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Preface 


... beware of mathematicians, and all those who make empty prophecies. 


St. Augustine, De Genesi ad Litteram libri duodecim. 
Liber Secundus, 17, 37. 


Prediction of individual sequences, the main theme of this book, has been studied in 
various fields, such as statistical decision theory, information theory, game theory, machine 
learning, and mathematical finance. Early appearances of the problem go back as far as 
the 1950s, with the pioneering work of Blackwell, Hannan, and others. Even though the 
focus of investigation varied across these fields, some of the main principles have been 
discovered independently. Evolution of ideas remained parallel for quite some time. As 
each community developed its own vocabulary, communication became difficult. By the 
mid-1990s, however, it became clear that researchers of the different fields had a lot to 
teach each other. 

When we decided to write this book, in 2001, one of our main purposes was to investigate 
these connections and help ideas circulate more fluently. In retrospect, we now realize that 
the interplay among these many fields is far richer than we suspected. For this reason, 
exploring this beautiful subject during the preparation of the book became a most exciting 
experience — we really hope to succeed in transmitting this excitement to the reader. Today, 
several hundreds of pages later, we still feel there remains a lot to discover. This book just 
shows the first steps of some largely unexplored paths. We invite the reader to join us in 
finding out where these paths lead and where they connect. 

The book should by no means be treated as an encyclopedia of the subject. The selection 
of the material reflects our personal taste. Large parts of the manuscript have been read 
and constructively criticized by Gilles Stoltz, whose generous help we greatly appreci- 
ate. Claudio Gentile and Andras György also helped us with many important and useful 
comments. We are equally grateful to the members of a seminar group in Budapest who 
periodically gathered in Laci Györfi’s office at the Technical University and unmercifully 
questioned every line of the manuscript they had access to, and to György Ottucsak who 
diligently communicated to us all these questions and remarks. The members of the group 
are Andras Antos, Balazs Csanád Csaji, Laci Györfi, Andras György, Levente Kocsis, 
György Ottucsak, Marti Pintér, Csaba Szepesvari, and Kati Varga. Of course, all remaining 
errors are our responsibility. 

We thank all our friends and colleagues who, often without realizing it, taught us many 
ideas, tricks, and points of view that helped us understand the subtleties of the material. 
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by the work of Alighiero Boetti (1940-1994). 

Finally, a few grateful words to our families. Nicolo thanks Betta, Silvia, and Luca for 
filling his life with love, joy, and meaning. Gabor wants to thank Arrate for making the 
last ten years a pleasant adventure and Dani for so much fun, and to welcome Frida to the 
family. 

Have as much fun reading this book as we had writing it! 


Nicolò Cesa-Bianchi, Milano 
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Introduction 


1.1 Prediction 


Prediction, as we understand it in this book, is concerned with guessing the short-term evo- 
lution of certain phenomena. Examples of prediction problems are forecasting tomorrow’s 
temperature at a given location or guessing which asset will achieve the best performance 
over the next month. Despite their different nature, these tasks look similar at an abstract 
level: one must predict the next element of an unknown sequence given some knowledge 
about the past elements and possibly other available information. In this book we develop 
a formal theory of this general prediction problem. To properly address the diversity of 
potential applications without sacrificing mathematical rigor, the theory will be able to 
accommodate different formalizations of the entities involved in a forecasting task, such as 
the elements forming the sequence, the criterion used to measure the quality of a forecast, 
the protocol specifying how the predictor receives feedback about the sequence, and any 
possible side information provided to the predictor. 

In the most basic version of the sequential prediction problem, the predictor — or fore- 
caster — observes one after another the elements of a sequence y1, y2,... of symbols. At 
each time t = 1,2,..., before the tth symbol of the sequence is revealed, the forecaster 
guesses its value y; on the basis of the previous f — 1 observations. 

In the classical statistical theory of sequential prediction, the sequence of elements, 
which we call outcomes, is assumed to be a realization of a stationary stochastic process. 
Under this hypothesis, statistical properties of the process may be estimated on the basis 
of the sequence of past observations, and effective prediction rules can be derived from 
these estimates. In such a setup, the risk of a prediction rule may be defined as the expected 
value of some loss function measuring the discrepancy between predicted value and true 
outcome, and different rules are compared based on the behavior of their risk. 

This book looks at prediction from a quite different angle. We abandon the basic assump- 
tion that the outcomes are generated by an underlying stochastic process and view the 
sequence y1, y2,...as the product of some unknown and unspecified mechanism (which 
could be deterministic, stochastic, or even adversarially adaptive to our own behavior). To 
contrast it with stochastic modeling, this approach has often been referred to as prediction 
of individual sequences. 

Without a probabilistic model, the notion of risk cannot be defined, and it is not imme- 
diately obvious how the goals of prediction should be set up formally. Indeed, several 
possibilities exist, many of which are discussed in this book. In our basic model, the per- 
formance of the forecaster is measured by the loss accumulated during many rounds of 
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prediction, where loss is scored by some fixed loss function. Since we want to avoid any 
assumption on the way the sequence to be predicted is generated, there is no obvious base- 
line against which to measure the forecaster’s performance. To provide such a baseline, 
we introduce a class of reference forecasters, also called experts. These experts make their 
prediction available to the forecaster before the next outcome is revealed. The forecaster 
can then make his own prediction depend on the experts’ “advice” in order to keep his 
cumulative loss close to that of the best reference forecaster in the class. 

The difference between the forecaster’s accumulated loss and that of an expert is called 
regret, as it measures how much the forecaster regrets, in hindsight, of not having followed 
the advice of this particular expert. Regret is a basic notion of this book, and a lot of 
attention is payed to constructing forecasting strategies that guarantee a small regret with 
respect to all experts in the class. As it turns out, the possibility of keeping the regrets small 
depends largely on the size and structure of the class of experts, and on the loss function. 
This model of prediction using expert advice is defined formally in Chapter 2 and serves 
as a basis for a large part of the book. 

The abstract notion of an “expert” can be interpreted in different ways, also depending on 
the specific application that is being considered. In some cases it is possible to view an expert 
as a black box of unknown computational power, possibly with access to private sources 
of side information. In other applications, the class of experts is collectively regarded as a 
statistical model, where each expert in the class represents an optimal forecaster for some 
given “state of nature.” With respect to this last interpretation, the goal of minimizing regret 
on arbitrary sequences may be thought of as a robustness requirement. Indeed, a small 
regret guarantees that, even when the model does not describe perfectly the state of nature, 
the forecaster does almost as well as the best element in the model fitted to the particular 
sequence of outcomes. In Chapters 2 and 3 we explore the basic possibilities and limitations 
of forecasters in this framework. 

Models of prediction of individual sequences arose in disparate areas motivated by 
problems as different as playing repeated games, compressing data, or gambling. Because 
of this diversity, it is not easy to trace back the first appearance of such a study. But 
it is now recognized that Blackwell, Hannan, Robbins, and the others who, as early as 
in the 1950s, studied the so-called sequential compound decision problem were the pio- 
neering contributors in the field. Indeed, many of the basic ideas appear in these early 
works, including the use of randomization as a powerful tool of achieving a small regret 
when it would otherwise be impossible. The model of randomized prediction is intro- 
duced in Chapter 4. In Chapter 6 several variants of the basic problem of randomized 
prediction are considered in which the information available to the forecaster is limited in 
some way. 

Another area in which prediction of individual sequences appeared naturally and found 
numerous applications is information theory. The influential work of Cover, Davisson, 
Lempel, Rissanen, Shtarkov, Ziv, and others gave the information-theoretic foundations 
of sequential prediction, first motivated by applications for data compression and “uni- 
versal” coding, and later extended to models of sequential gambling and investment. This 
theory mostly concentrates on a particular loss function, the so-called logarithmic or self- 
information loss, as it has a natural interpretation in the framework of sequential probability 
assignment. In this version of the prediction problem, studied in Chapters 9 and 10, at each 
time instance the forecaster determines a probability distribution over the set of possible 
outcomes. The total likelihood assigned to the entire sequence of outcomes is then used to 
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score the forecaster. Sequential probability assignment has been studied in different closely 
related models in statistics, including bayesian frameworks and the problem of calibration 
in various forms. Dawid’s “prequential” statistics is also close in spirit to some of the 
problems discussed here. 

In computer science, algorithms that receive their input sequentially are said to operate 
in an online modality. Typical application areas of online algorithms include tasks that 
involve sequences of decisions, like when one chooses how to serve each incoming request 
in a stream. The similarity between decision problems and prediction problems, and the 
fact that online algorithms are typically analyzed on arbitrary sequences of inputs, has 
resulted in a fruitful exchange of ideas and techniques between the two fields. However, 
some crucial features of sequential decision problems that are missing in the prediction 
framework (like the presence of states to model the interaction between the decision maker 
and the mechanism generating the stream of requests) has so far prevented the derivation 
of a general theory allowing a unified analysis of both types of problems. 


1.2 Learning 


Prediction of individual sequences has also been a main topic of research in the theory 
of machine learning, more concretely in the area of online learning. In fact, in the late 
1980s—early 1990s the paradigm of prediction with expert advice was first introduced as a 
model of online learning in the pioneering papers of De Santis, Markowski, and Wegman; 
Littlestone and Warmuth; and Vovk, and it has been intensively investigated ever since. An 
interesting extension of the model allows the forecaster to consider other information apart 
from the past outcomes of the sequence to be predicted. By considering side information 
taking values in a vector space, and experts that are linear functions of the side information 
vector, one obtains classical models of online pattern recognition. For example, Rosenblatt’s 
Perceptron algorithm, the Widrow-Hoff rule, and ridge regression can be naturally cast in 
this framework. Chapters 11 and 12 are devoted to the study of such online learning 
algorithms. 

Researchers in machine learning and information theory have also been interested in the 
computational aspects of prediction. This becomes a particularly important problem when 
very large classes of reference forecasters are considered, and various tricks need to be 
invented to make predictors feasible for practical applications. Chapter 5 gathers some of 
these basic tricks illustrated on a few prototypical examples. 


1.3 Games 


The online prediction model studied in this book has an intimate connection with game 
theory. First of all, the model is most naturally defined in terms of a repeated game 
played between the forecaster and the “environment” generating the outcome sequence, 
thus offering a convenient way of describing variants of the basic theme. However, the 
connection is much deeper. For example, in Chapter 7 we show that classical minimax 
theorems of game theory can be recovered as simple applications of some basic bounds for 
the performance of sequential prediction algorithms. On the other hand, certain generalized 
minimax theorems, most notably Blackwell’ s approachability theorem can be used to define 
forecasters with good performance on individual sequences. 
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Perhaps surprisingly, the connection goes even deeper. It turns out that if all players in a 
repeated normal form game play according to certain simple regret-minimizing prediction 
strategies, then the induced dynamics leads to equilibrium in a certain sense. This interesting 
line of research has been gaining terrain in game theory, based on the pioneering work of 
Foster, Vohra, Hart, Mas-Colell, and others. In Chapter 7 we discuss the possibilities and 
limitations of strategies based on regret minimizing forecasting algorithms that lead to 
various notions of equilibria. 


1.4 A Gentle Start 


To introduce the reader to the spirit of the results contained in this book, we now describe 
in detail a simple example of a forecasting procedure and then analyze its performance on 
an arbitrary sequence of outcomes. 

Consider the problem of predicting an unknown sequence y1, y2, ... of bits y, € {0, 1}. 
At each time ż the forecaster first makes his guess p; € {0, 1} for y,. Then the true bit y; is 
revealed and the forecaster finds out whether his prediction was correct. To compute p; the 
forecaster listens to the advice of N experts. This advice takes the form of a binary vector 
(fi.t,---» fv.r), where fis € {0, 1} is the prediction that expert i makes for the next bit yz. 
Our goal is to bound the number of time steps ¢ in which P; Æ y,, that is, to bound the 
number of mistakes made by the forecaster. 

To start with an even simpler case, assume we are told in advance that, on this particular 
sequence of outcomes, there is some expert i that makes no mistakes. That is, we know 
that fi: = y: for some i and for all t, but we do not know for which 7 this holds. Using 
this information, it is not hard to devise a forecasting strategy that makes at most |log, N | 
mistakes on the sequence. To see this, consider the forecaster that starts by assigning a 
weight w ; = 1 to each expert j = 1,..., N. At every time step f, the forecaster predicts 
with P, = 1 if and only if the number of experts j with w; = 1 and such that f;; = 1 
is bigger than those with w; = 1 and such that f;,, = 0. After y, is revealed, if P, Æ yr, 
then the forecaster performs the assignment w < 0 on the weight of all experts k such 
that fk 4 yı. In words, this forecaster keeps track of which experts make a mistake and 
predicts according to the majority of the experts that have been always correct. 

The analysis is immediate. Let W,, be the sum of the weights of all experts after the 
forecaster has made m mistakes. Initially, m = 0 and Wọ = N. When the forecaster makes 
his mth mistake, at least half of the experts that have been always correct so far make their 
first mistake. This implies that Wm, < W,,—-1/2, since those experts that were incorrect for 
the first time have their weight zeroed by the forecaster. Since the above inequality holds 
for all m > 1, we have Wm < Wo/2”. Recalling that expert i never makes a mistake, we 
know that w; = 1, which implies that W„ > 1. Using this together with Wọ = N, we thus 
find that 1 < N/2”. Solving for m (which must be an integer) gives the claimed inequality 
m < |log, N]. 

We now move on to analyze the general case, in which the forecaster does not have any 
preliminary information on the number of mistakes the experts will make on the sequence. 
Our goal now is to relate the number of mistakes made by the forecaster to the number of 
mistakes made by the best expert, irrespective of which sequence is being predicted. 

Looking back at the previous forecasting strategy, it is clear that setting the weight of 
an incorrect expert to zero makes sense only if we are sure that some expert will never 
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make a mistake. Without this guarantee, a safer choice could be performing the assignment 
wz <— P wp every time expert k makes a mistake, where 0 < $ < | is a free parameter. In 
other words, every time an expert is incorrect, instead of zeroing its weight we shrink it by a 
constant factor. This is the only modification we make to the old forecaster, and this makes 
its analysis almost as easy as the previous one. More precisely, the new forecaster compares 
the total weight of the experts that recommend predicting 1 with those that recommend 0 
and predicts according to the weighted majority. As before, at the time the forecaster makes 
his mth mistake, the overall weight of the incorrect experts must be at least Wm—1/2. The 
weight of these experts is then multiplied by 6, and the weight of the other experts, which 
is at most W,,_;/2, is left unchanged. Hence, we have Wm < Wm-1/2 + B Wm-1/2. As 
this holds for all m > 1, we get Wm < Wo(1 + B)’”/2”. Now let k be the expert that has 
made the fewest mistakes when the forecaster made his mth mistake. Denote this minimal 
number of mistakes by m*. Then the current weight of this expert is wą = 8’ , and thus we 
have Wm > 6”. This provides the inequality B”” < Wo(1 + By” /2”. Using this, together 
with Wọ = N , we get the final bound 


log, N + m* log,(1/B) 


2 
log, T5 


For any fixed value of £, this inequality establishes a linear dependence between the 
mistakes made by the forecaster, after any number of predictions, and the mistakes made 
by the expert that is the best after that same number of predictions. Note that this bound 
holds irrespective of the choice of the sequence of outcomes. 

The fact that m and m* are linearly related means that, in some sense, the performance 
of this forecaster gracefully degrades as a function of the “misfit” m* between the experts 
and the outcome sequence. The bound also exhibits a mild dependence on the number of 
experts: the log, N term implies that, apart from computational considerations, doubling 
the number of experts causes the bound to increase by a small additive term. 

Notwithstanding its simplicity, this example contains some of the main themes developed 
in the book, such as the idea of computing predictions using weights that are functions of 
the experts’ past performance. In the subsequent chapters we develop this and many other 
ideas in a rigorous and systematic manner with the intent of offering a comprehensive view 
on the many facets of this fascinating subject. 


1.5 A Note to the Reader 


The book is addressed to researchers and students of computer science, mathematics, 
engineering, and economics who are interested in various aspects of prediction and learning. 
Even though we tried to make the text as self-contained as possible, the reader is assumed 
to be comfortable with some basic notions of probability, analysis, and linear algebra. To 
help the reader, we collect in the Appendix some technical tools used in the book. Some of 
this material is quite standard but may not be well known to all potential readers. 

In order to minimize interruptions in the flow of the text, we gathered bibliographical 
references at the end of each chapter. In these references we intend to trace back the origin 
of the results described in the text and point to some relevant literature. We apologize for any 
possible omissions. Some of the material is published here for the first time. These results 
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are not flagged. Each chapter is concluded with a list of exercises whose level of difficulty 
varies between distant extremes. Some of the exercises can be solved by an easy adaptation 
of the material described in the main text. These should help the reader in mastering the 
material. Some others resume difficult research results. In some cases we offer guidance to 
the solution, but there is no solution manual. 

Figure 1.1 describes the dependence structure of the chapters of the book. This should 
help the reader to focus on specific topics and teachers to organize the material of various 
possible courses. 


Introduction 

Prediction with expert advice (2) 

Tight bounds for specific losses 

Randomized prediction 

Efficient forecasters for large classes of experts (3) (4) 
Prediction with limited feedback 

Prediction and playing games 

Absolute loss ka G) O) 9 9 
Logarithmic loss 

10 Sequential investment 

11 Linear pattern recognition O © Q 
12 Linear classification 
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Figure 1.1. The dependence structure of the chapters. 
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Prediction with Expert Advice 


The model of prediction with expert advice, introduced in this chapter, provides the foun- 
dations to the theory of prediction of individual sequences that we develop in the rest of 
the book. 

Prediction with expert advice is based on the following protocol for sequential deci- 
sions: the decision maker is a forecaster whose goal is to predict an unknown sequence 
yi, Y2... Of elements of an outcome space Y. The forecaster’s predictions P1, P2... 
belong to a decision space D, which we assume to be a convex subset of a vec- 
tor space. In some special cases we take D = JY, but in general D may be different 
from yY. 

The forecaster computes his predictions in a sequential fashion, and his predictive 
performance is compared to that of a set of reference forecasters that we call experts. 
More precisely, at each time ż the forecaster has access to the set { fer: EEE } of expert 
predictions fz, € D, where £E is a fixed set of indices for the experts. On the basis of the 
experts’ predictions, the forecaster computes his own guess P, for the next outcome y,. 
After P; is computed, the true outcome y, is revealed. 

The predictions of forecaster and experts are scored using a nonnegative loss function 
L£:DxYVR. 

This prediction protocol can be naturally viewed as the following repeated game between 
“forecaster,” who makes guesses p,, and “environment,” who chooses the expert advice 
{fet : Be E} and sets the true outcomes y,. 


PREDICTION WITH EXPERT ADVICE 


Parameters: decision space D, outcome space y, loss function £, set E of expert 
indices. 


For each round t = 1, 2,... 


(1) the environment chooses the next outcome y, and the expert advice 
{ ferED: EEE bs the expert advice is revealed to the forecaster; 

(2) the forecaster chooses the prediction pP; € D; 

(3) the environment reveals the next outcome y; € V; 

(4) the forecaster incurs loss ¢(p;,y;) and each expert E incurs loss 


L( fet, yı). 
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The forecaster’s goal is to keep as small as possible the cumulative regret (or sim- 
ply regret) with respect to each expert. This quantity is defined, for expert E, by the 
sum 


n 


Ren = X (Ea y) — (fea, Ye) = Ln — Len, 


t=1 


where we use L, = De 1 £(P;, yr) to denote the forecaster’s cumulative loss and Lg „n = 

ine 1 (fer, Yt) to denote the cumulative loss of expert E. Hence, Rg,n is the difference 
between the forecaster’s total loss and that of expert E after n prediction rounds. We also 
define the instantaneous regret with respect to expert E at time t by rg, = (Pr, Ye) — 
L( fer, Yt), so that Rg n = ee rg. One may think about rg; as the regret the forecaster 
feels of not having listened to the advice of expert E right after the tth outcome y, has been 
revealed. 

Throughout the rest of this chapter we assume that the number of experts is finite, 
E = {1,2,..., N}, and use the index i = 1,...,N to refer to an expert. The goal of 
the forecaster is to predict so that the regret is as small as possible for all sequences of 
outcomes. For example, the forecaster may want to have a vanishing per-round regret, that is, 
to achieve 


. 1 (x é noo 
max Rin =o(n) or, equivalently, —{L,— min Lin] — 0, 
i=1,...,.N n N 


where the convergence is uniform over the choice of the outcome sequence and the choice 
of the expert advice. In the next section we show that this ambitious goal may be achieved 
by a simple forecaster under mild conditions. 

The rest of the chapter is structured as follows. In Section 2.1 we introduce the important 
class of weighted average forecasters, describe the subclass of potential-based forecasters, 
and analyze two important special cases: the polynomially weighted average forecaster 
and the exponentially weighted average forecaster. This latter forecaster is quite cen- 
tral in our theory, and the following four sections are all concerned with various issues 
related to it: Section 2.2 shows certain optimality properties, Section 2.3 addresses the 
problem of tuning dynamically the parameter of the potential, Section 2.4 investigates 
the problem of obtaining improved regret bounds when the loss of the best expert is 
small, and Section 2.5 investigates the special case of differentiable loss functions. Starting 
with Section 2.6, we discover the advantages of rescaling the loss function. This sim- 
ple trick allows us to derive new and even sharper performance bounds. In Section 2.7 
we introduce and analyze a weighted average forecaster for rescaled losses that, unlike 
the previous ones, is not based on the notion of potential. In Section 2.8 we return to 
the exponentially weighted average forecaster and derive improved regret bounds based on 
rescaling the loss function. Sections 2.9 and 2.10 address some general issues in the prob- 
lem of prediction with expert advice, including the definition of minimax values. Finally, 
in Section 2.11 we discuss a variant of the notion of regret where discount factors are 
introduced. 
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2.1 Weighted Average Prediction 


A natural forecasting strategy in this framework is based on computing a weighted average 
of experts’ predictions. That is, the forecaster predicts at time t according to 


N 
a ey Wi,t—1 fit 
| RSN o > 
Èj Wj 
where W1,t—1,..., Wy 1-1 = O are the weights assigned to the experts at time ¢. Note that 
P: € D, since it is a convex combination of the expert advice fis, ..., fn € D and D 


is convex by our assumptions. As our goal is to minimize the regret, it is reasonable to 
choose the weights according to the regret up to time ¢ — 1. If Ri s—-1 is large, then we 
assign a large weight w;,_; to expert i, and vice versa. As Rj -1 = L, 1 — Lj4-1, this 
results in weighting more those experts i whose cumulative loss L;,;_; is small. Hence, we 
view the weight as an arbitrary increasing function of the expert’s regret. For reasons that 
will become apparent shortly, we find it convenient to write this function as the derivative 
of a nonnegative, convex, and increasing function ¢ : R > R. We write ø’ to denote this 
derivative. The forecaster uses ¢’ to determine the weight w; -1 = $'(Rj,;-1) assigned to 
the ith expert. Therefore, the prediction p; at time t of the weighted average forecaster is 
defined by 


a Eia O Rie) fi 
ye OR) 


(weighted average forecaster). 


Note that this is a legitimate forecaster as P, is computed on the basis of the experts’ advice 
at time ¢ and the cumulative regrets up to time ¢ — 1. 
We start the analysis of weighted average forecasters by a simple technical observation. 


Lemma 2.1. If the loss function £ is convex in its first argument, then 


N 
sup ) ried (Ris-1) < 0. 


EY j=] 


Proof. Using Jensen’s inequality, for all y € V, 


Die P RiDS |) Din SRi- Uis Y) 


LD, y) =£ y] < 
ya OCR ja) EiL ØR je) 


Rearranging, we obtain the statement. W 


The simple observation of the lemma above allows us to interpret the weighted average 
forecaster in an interesting way. To do this, introduce the instantaneous regret vector 


N 
r; = (Fit. rye) ER 
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and the corresponding regret vector R, = )~/_,¥;. It is convenient to introduce also a 
potential function ® : RN — R of the form 


N 
u)= y (> ow) (potential function), 
i=1 


where ġ : R > Ris any nonnegative, increasing, and twice differentiable function, and y : 
R — Ris any nonnegative, strictly increasing, concave, and twice differentiable auxiliary 
function. 

Using the notion of potential function, we can give the following equivalent definition 
of the weighted average forecaster 


am Era VORA: fit 
A 
LiL VOR 1); 


where V ®(R,;_1); = 0®(R;_1)/0Rj,;-1. We say that a forecaster defined as above is based 
on the potential ®. Even though the definition of the weighted average forecaster is inde- 
pendent of the choice of y (the derivatives wy’ cancel in the definition of p, above), the 
proof of the main result of this chapter, Theorem 2.1, reveals that y plays an important role 
in the analysis. We remark that convexity of ¢ is not needed to prove Theorem 2.1, and this 
is the reason why convexity is not mentioned in the above definition of potential function. 
On the other hand, all forecasters in this book that are based on potential functions and 
have a vanishing per-round regret are constructed using a convex ¢ (see also Exercise 2.2). 
The statement of Lemma 2.1 is equivalent to 


sup r; - V®(R,_1) < 0 (Blackwell condition). 

WEY 
The notation u v stands for the the inner product of two vectors defined by u-v = 
uivı +---+Uunvy. We call the above inequality Blackwell condition because of its sim- 
ilarity to a key property used in the proof of the celebrated Blackwell’s approachability 
theorem. The theorem, and its connection to the above inequality, are explored in Sec- 
tions 7.7 and 7.8. Figure 2.1 shows an example of a prediction satisfying the Blackwell 
condition. 

The Blackwell condition shows that the function ® plays a role vaguely similar to the 
potential in a dynamical system: the weighted average forecaster, by forcing the regret 
vector to point away from the gradient of ® irrespective to the outcome y,, tends to keep 
the point R, close to the minimum of ®. This property, in fact, suggests a simple analysis 
because the increments of the potential function ® may now be easily bounded by Taylor’s 
theorem. The role of the function w is simply to obtain better bounds with this argument. 

The next theorem applies to any forecaster satisfying the Blackwell condition (and thus 
not only to weighted average forecasters). However, it will imply several interesting bounds 
for different versions of the weighted average forecaster. 


Theorem 2.1. Assume that a forecaster satisfies the Blackwell condition for a potential 


pu) =v ose pu). Then, for alln = 1,2, ..., 


1 n 
®(R,) < 0) +5 DC), 


t=1 
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0 


Figure 2.1. An illustration of the Blackwell condition with N = 2. The dashed line shows the 
points in regret space with potential equal to 1. The prediction at time ¢ changed the potential from 


®(R,_1) = 1 to ®(R,) = O(R,_; + r;). Though ®(R,) > &(R,_,), the inner product between r, and 
the gradient V ®(R,_,) is negative, and thus the Blackwell condition holds 


where 


ucRY 


C(r,) = sup y” (> plui ) 2 Yo" wir}, 


Proof. We estimate ®(R,) in terms of ®(R,_;) using Taylor’s theorem. Thus, we obtain 
PR) = (R, +r) 


N N 


3P 
= (R1) + VÖR) r, + 3 yY —— Fitt jt 
=I duj ou; £ 
(where £ is some vector in R”) 
< @(R,_;)+ = aao = al itl jt 


where the inequality follows by the Blackwell condition. Now straightforward calculation 
shows that 
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N N N 
= 402 pi ) (Seem) +y (2:00) Zero 
j=l 


i=l 


N 
<y c p&i) )2 Xo" @€Dr?, (since y is concave) 
i=l 


where at the last step we used the definition of C(r,). Thus, we have obtained 
®(R,) — ®(R,_1) < C(r,)/2. The proof is finished by summing this inequality for 
t=1,...,n. E 


Theorem 2.1 can be used as follows. By monotonicity of y and ¢, 


y (o(, max Ri, J) =y (2 max (Ri, DE <y (> HCR; »)- = O(R,). 


wN O JJ N= 1,..., 
i=l 


Note that y is invertible by the definition of the potential function. If ¢ is invertible as well, 
then we get 

R; as ee a p R, ’ 
max, Rin < $7 (Y7 (®R,))) 
where ®(R,,) is replaced with the bound provided by Theorem 2.1. In the first of the two 
examples that follow, however, @ is not invertible, and thus max;=;,..v Ri,n is directly 
majorized using a function of the bound provided by Theorem 2.1. 


asin 


Polynomially Weighted Average Forecaster 
Consider the polynomially weighted average forecaster based on the potential 


N 2/p 
®,(u) = (Zex) = als (polynomial potential), 


where p > 2. Here u, denotes the vector of positive parts of the components of u. The 
weigths assigned to the experts are then given by 


Raye 
IR,- 187? 


and the forecaster’s predictions are just the weighted average of the experts predictions 


N /t-1 p—1 
> (x (tPs, Ys) — €Cfis Ys) ») fiat 


Wiz-1 = V®,(Ri-1); = 


aw i=l yel 
Pt = p-l 


N t—1 
2 o LPs. yY) — UFs: wo) 


+ 


Corollary 2.1. Assume that the loss function £ is convex in its first argument and that it takes 
values in [0, 1]. Then, for any sequence yj, y2,... € Y of outcomes and for any n > 1, the 
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regret of the polynomially weighted average forecaster satisfies 


T= min Lin < Valp — DNP. 


This shows that, for all p > 2, the per-round regret converges to zero at a rate O(1/,./n) 
uniformly over the outcome sequence and the expert advice. The choice p = 2 yields a 
particularly simple algorithm. On the other hand, the choice p = 21n N (for N > 2), which 
approximately minimizes the upper bound, leads to 


L, — min Lin < Jne(2InN — 1) 


yaaa 


yielding a significantly better dependence on the number of experts N. 


Proof of Corollary 2.1. Apply Theorem 2.1 using the polynomial potential. Then d(x) = 
x? and w(x) = x7/P, x > 0. Moreover 


w'@) = and $"(x) = p(p = Dx. 


px-2)/P 


By Holder’s inequality, 


N N 
> oir}, = pp- VD) owe? r? 
i=1 i=1 


(p-2)/P 7 N 2/p 
p-2\ P/P 72 s 
(wire) eT a ae 
i=l 


N 


L 


< nn 


1 
Thus, 


i=1 


N N N 2/p 
y (> ou) XO ob" uir?, < Xp- 1) (> hl 
i=1 i=1 


and the conditions of Theorem 2.1 are satisfied with C (r;,) < 2(p — 1) l|r; I. Since ®,(0) = 
0, Theorem 2.1, together with the boundedness of the loss function, implies that 


N 2/p n 
(Rin)? } = Ra) < (p -DX Mell < ap- DN”. 
+ 
i=1 


t=1 


Finally, since 


N 1/p 
A . oP 
Ly ae min Lin = max Rin < (> (k): 


icles N "| e 


the result follows. W 


Remark 2.1. We have defined the polynomial potential as ®,(u) = ||u+ \|2,, which corre- 
sponds to taking y(x) = x?/?. Recall that y does not have any influence on the prediction, it 
only has a role in the analysis. The particular form analyzed here is chosen by convenience, 
but there are other possibilities leading to similar results. For example, one may argue that 
it is more natural to take y(x) = x!/P , which leads to the potential function ®(u) = ||u+ || p 
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Figure 2.2. Plots of the polynomial potential function ®,(u) for N = 2 experts with exponents 
p =2and p = 10. 


We leave it as an exercise to work out a bound similar to that of Corollary 2.1 based on this 
choice. 


Exponentially Weighted Average Forecaster 
Our second main example is the exponentially weighted average forecaster based on the 
potential 


N 


1 
®,(u) = -1n (> emi (exponential potential), 
n 


i=l 
where 7 is a positive parameter. In this case, the weigths assigned to the experts are of the 
form 


eNRit-1 


Wir-1 = V®,(R:-1)i = =. 
Do ene 


and the weighted average forecaster simplifies to 


~ ye exp (n (Li-1 = Tie) Fiat = YS e Mbit fiy 
pay exp (n ‘om — Lj4-1)) yu e™"Lj 


The beauty of the exponentially weighted average forecaster is that it only depends on the 
past performance of the experts, whereas the predictions made using other general potentials 
depend on the past predictions Ps, s < t, as well. Furthermore, the weights that the forecaster 
assigns to the experts are computable in a simple incremental way: let w1,—1,..., WN, 1—1 
be the weights used at round ¢ to compute the prediction P, = ey Wi t—1 fis- Then, as 
one can easily verify, 


wipe in) 


EL W je Fi)” 


Wit = 


A simple application of Theorem 2.1 reveals the following performance bound for the 
exponentially weighted average forecaster. 
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Figure 2.3. Plots of the exponential potential function ®,(u) for N = 2 experts with n = 0.5 and 
n=2. 


Corollary 2.2. Assume that the loss function £ is convex in its first argument and that it 
takes values in [0, 1]. For any n and n > 0, and for all y,,..., Y, € VY, the regret of the 
exponentially weighted average forecaster satisfies 

~ . InN nn 

L,- min Lin <—+—. 

i=1,.,N °° n 2 

Optimizing the upper bound suggests the choice n = ./2In N /n. In this case the upper 
bound becomes y~ 2n InN, which is slightly better than the best bound we obtained using 
p(x) = x} with p = 21n N. In the next section we improve the bound of Corollary 2.2 by 
a direct analysis. The disadvantage of the exponential weighting is that optimal tuning of 
the parameter 7 requires knowledge of the horizon n in advance. In the next two sections 
we describe versions of the exponentially weighted average forecaster that do not suffer 
from this drawback. 


Proof of Corollary 2.2. Apply Theorem 2.1 using the exponential potential. Then d(x) = 
e™, w(x) = (1/n) Inx, and 


N N 
W (Dio) ) de" Gar?, <n max r? <n. 


i=l i=l 
Using ®,(0) = (In N)/n, Theorem 2.1 implies that 


InN nn 
max Rin < ®,(R,) < —+— 
i=1,. N n 2 


as desired. W 


2.2 An Optimal Bound 


The purpose of this section is to show that, even for general convex loss functions, the 
bound of Corollary 2.2 may be improved for the exponentially weighted average forecaster. 
The following result improves Corollary 2.2 by a constant factor. In Section 3.7 we see that 
the bound obtained here cannot be improved further. 
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Theorem 2.2. Assume that the loss function £ is convex in its first argument and that it 


takes values in [0, 1]. For any n and n > 0, and for all y1, ..., Yn E€ Y, the regret of the 
exponentially weighted average forecaster satisfies 


A . InN nn 


In particular, with n = ./8 In N /n, the upper bound becomes ./(n/2) InN. 


The proof is similar, in spirit, to that of Corollary 2.2, but now, instead of bounding the 
evolution of (1/7) In OS i enu), we bound the related quantities (1/n)ln(W,/ W;—1), where 


N N 

—nL 

W=} wu = Doe 
i=l i=l 


for t > 1, and Wo = N. In the proof we use the following classical inequality due to 
Hoeffding [161]. 


Lemma 2.2. Let X be a random variable with a < X < b. Then for any s € R, 


s*(b — ay? 
: : 


In z [e°*] <sEX+ 
The proof is in Section A.1 of the Appendix. 


Proof of Theorem 2.2. First observe that 


w N 
In — = In e Min | — InN 
Ra ER] 


> in( max Sue) —InN 
i N 


= =n min Lin —InN. (2.1) 
On the other hand, for each t = 1,...,n, 
W, 5 eT "Ei YO e7nLit 
In = ln i=l N 
W,-1 eyes 


N —nb(f. 
J i Wi t—1€ NEC Sit Yt) 
X y W; 
j=l j,t-1 


Now using Lemma 2.2, we observe that the quantity above may be upper bounded by 


ar: Wit fit n? 
ne N oa airs 
dejar Wj 8 

2 


~ n 
—nl(Pr, yr) ae g’ 


= In 


ee Week Gaye n? 
n N + 8 
jai Wj 


IA 
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where we used the convexity of the loss function in its first argument and the definition of 


the exponentially weighted average forecaster. Summing over t = 1, ...,n, we get 
Wn T n? 
In < —nLy,+ —n. 
Woo 8 


Combining this with the lower bound (2.1) and solving for By: we find that 


= i InN n 
L,< min Lin + — + -n 
; i n 


waned 


as desired. W 


2.3 Bounds That Hold Uniformly over Time 


As we pointed out in the previous section, the exponentially weighted average forecaster 
has the disadvantage that the regret bound of Corollary 2.2 does not hold uniformly over 
sequences of any length, but only for sequences of a given length n, where n is the value 
used to choose the parameter 7. To fix this problem one can use the so-called “doubling 
trick.” The idea is to partition time into periods of exponentially increasing lengths. In each 
period, the weighted average forecaster is used with a parameter 7 chosen optimally for 
the length of the interval. When the period ends, the weighted average forecaster is reset 
and then is started again in the next period with a new value for 7. If the doubling trick is 
used with the exponentially weighted average forecaster, then it achieves, for any sequence 
y1, y2,--- E Y of outcomes and for any n > 1, 


(see Exercise 2.8). This bound is worse than that of Theorem 2.2 by a factor of J/2 / (/2 — 1), 
which is about 3.41. 

Considering that the doubling trick resets the weights of the underlying forecaster 
after each period, one may wonder whether a better bound could be obtained by a more 
direct argument. In fact, we can avoid the doubling trick altogether by using the weighted 
average forecaster with a time-varying potential. That is, we let the parameter 7 of the 
exponential potential depend on the round number t. As the best nonuniform bounds for 
the exponential potential are obtained by choosing n = ./8(n N)/n, a natural choice for 
a time-varying exponential potential is thus n, = ./8(n N)/t. By adapting the approach 
used to prove Theorem 2.2, we obtain for this choice of n, a regret bound whose main term 
is 2,/(n/2) InN and is therefore better than the doubling trick bound. More precisely, we 
prove the following result. 


Theorem 2.3. Assume that the loss function £ is convex in its first argument and takes 
values in [0, 1]. For all n > 1 and for all yı, ..., Yn E VY, the regret of the exponentially 
weighted average forecaster with time-varying parameter n, = ~ 8(ln N )/t satisfies 


os : n InN 
L,- min Lin <2,/-=InN +,/—. 
i=l, N 2 8 
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The exponentially weighted average forecaster with time-varying potential predicts with 
Pi = sae FiaWir—1/W2-1, where W,_) = De Ww jr—1 and w; -1 = e~" 4-1, The poten- 
tial parameter is chosen as 1, = ./a(InN)/t, where a > 0 is determined by the analy- 
sis. We use Wisi = e™™Lin to denote the weight w;,_;, where the parameter n, is 
replaced by ;_,. Finally, we use k, to denote the expert whose loss after the first t 
rounds is the lowest (ties are broken by choosing the expert with smallest index). That is, 
Lg, = minj<y Li.. In the proof of the theorem, we also make use of the following technical 
lemma. 


Lemma 2.3. For all N > 2, for all B >a > 0, and for all d,,...,dn > 0 such that 
eo Sk 


N „adi = 
In Liste ee ” nN. 
Ž jz e~Bdj a 
Proof. We begin by writing 
N —ad; N —ad; 
In Sie =| Ži e pees InE [e@ | < (B = a) iD 


n = 


by Jensen’s inequality, where D is a random variable taking value d; with probability 
eek Ys e-4i for each i = 1,..., N. Because D takes at most N distinct values, its 
entropy H (D) is at most In N (see Section A.2 in the Appendix). Therefore, 


InN > H(D) 
1 


N N 
= ) e-0% ad; + In) eta] —— 
ae ead; 
i=l k=1 j=l 


N 


D +ln ` eu 
k=1 


II 
Q 


=akD, 


where the last inequality holds because yE 1 e~*4 > 1, Hence E D < (InN)/a. As B > a 
by hypothesis, we can substitute the upper bound on E D in the first derivation above and 
conclude the proof. W 


We are now ready to prove the main theorem. 


Proof of Theorem 2.3. As in the proof of Theorem 2.2, we study the evolution of 
In(W,/W,_1). However, here we need to couple this with In(wz,_,;-1/w«,.r), including 
in both terms the time-varying parameter n;. Keeping track of the currently best expert, k; 
is used to lower bound the weight In(w;,,,/W,). In fact, the weight of the overall best expert 
(after n rounds) could get arbitrarily small during the prediction process. We thus write the 
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following: 
1 In Wk,_1,t—1 1 In Wk,,t 
Nt Wi-1 N41 W, 
= ( 1 1 )in W, i 1 In wil Wi 1 In Wiesia 
Nt+1 Nt Wk,,t Nt Wk, t/ Wi Nr Wg / Wi 
ee D OS ————Ee 


(A) (B) (C) 


We now bound separately the three terms on the right-hand side. The term (A) is easily 
bounded by noting that 7,,; < n, and using the fact that k; is the index of the expert with 
the smallest loss after the first t rounds. Therefore, w,;,,;/W; must be at least 1/N . Thus we 


have 
1 1 W, 1 1 
a= ( )in : <( in, 
Nt+1 Nt Wk,,t Nt+1 Nt 


We proceed to bounding the term (B) as follows: 


De Ci wil Wi 1 ' YOX j eT LiL) 
(B) = n Ww. n re 
nr Wk, / We Nt Dje M (L ja Lks 


— 1 1 
<7 ny =( - >) mw, 
NtNt+1 Nt+1 Nr 


where the inequality is proven by applying Lemma 2.3 with d; = L;,, — Li,,,,,. Note that 
di > 0 because k; is the index of the expert with the smallest loss after the first £ rounds 
and ee eMid > ] as fori = ki41 we have d; = 0. The term (C) is first split as follows: 
Wk,-1,t-1/ Wi-1 a 1 Wk,_1,t—1 1 Ww; 


1 
(C) = — 1n ; - =—In f In 4 
Nr Wg / Wi Nr Wgt Nr Wi-1 


ts 


We treat separately each one of the two subterms on the right-hand side. For the first one, 
we have 


For the second subterm, we proceed similarly to the proof of Theorem 2.2 by applying 
Hoeffding’s bound (Lemma 2.2) to the random variable Z that takes the value £( fis, yr) 
with probability w;,-1/W;-1 for each i = 1,...,N 


/ 
Spee L iy So Mitte ntti 
Nr Wi-1 i= W,-1 
Wi,t-1 
=> i TUS y) t E 
i=1 Wi-1 
< 


—L(p;, yi) + 7 


where in the last step we used the convexity of the loss £. Finally, we substitute back in the 
main equation the bounds on the first two terms (A) and (B), and the bounds on the two 
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subterms of the term (C). Solving for €(p;, yr), we obtain 


ValnN 1 
8 vt 


1 Wk,,t 1 Wk,_1,t-1 
n 


Nt+1 W, Nt Wi-1 


1 1 
+2 ——)InN. 
Nt+1 Nt 


We apply the above inequality to each t = 1,..., and sum up using X`; L+, yr) = Des 


a 1 In Wk,,t 1 In "eet < 1 in Wko,0 = /In N 
Nt+1 W, Nr Wi-1 m1 Wo a 


t=1 


a oe LNA. /n+1 1 
le JE aln N) Ņ\alnN) 


Thus, we can write the bound 


x InN 1)InN InN 
L, < min inga N z oy Ai J : 
j N ’ 4 a a 


LP, y) < (Lrt = Lip ied) + 


53535 


Finally, by overapproximating and choosing a = 8 to trade off the two main terms, we get 


exe ; n InN 
L< min L;,+2,/—=InN +, — 
i=1,. N ” 2 8 


as desired. W 


2.4 An Improvement for Small Losses 


The regret bound for the exponentially weighted average forecaster shown in Theorem 2.2 
may be improved significantly whenever it is known beforehand that the cumulative loss 
of the best expert will be small. In some cases, as we will see, this improvement may even 
be achieved without any prior knowledge. 

To understand why one can hope for better bounds for the regret when the cumulative 
loss of the best expert is small, recall the simple example described in the introduction when 
VY = D = {0, 1} and £(P, y) = |P — y| € {0, 1} (this example violates our assumption that 
D is a convex set but helps understand the basic phenomenon). If the forecaster knows in 
advance that one of the N experts will suffer zero loss, that is, min;=1,...,y Li n = 0, then 
he may predict using the following simple “majority vote.” At time ¢ = 1 predict pı = 0 if 
and only if at least half of the N experts predict 0. After the first bit yı is revealed, discard 
all experts with f;ı Æ yı. At time t = 2 predict P2 = 0 if and only if at least half of the N 
remaining experts predict 0, discard all incorrectly predicting experts after y2 is revealed, 
and so on. Hence, each time the forecaster makes a mistake, at least half of the surviving 
experts are discarded (because the forecaster always votes according to the majority of the 
remaining experts). If only one expert remains, the forecaster does not make any further 
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mistake. Thus, the total number of mistakes of the forecaster (which, in this case, coincides 
with his regret) is at most |log, N J. 

In this section we show that regret bounds of the form O (ln N) are possible for all 
bounded and convex losses on convex decision spaces whenever the forecaster is given the 
information that the best expert will suffer no loss. Note that such bounds are significantly 
better than ./(7/2) In N, which holds independently of the loss of the best expert. 

For simplicity, we write L* = min;—,....y L;,,. We now show that whenever L* is known 
beforehand, the parameter 7 of the exponentially weighted average forecaster can be chosen 
so that his regret is bounded by ,/2L* In N + In N, which equals to In N when L* = 0 and 
is of order vn InN when L* is of order n. Our main tool is the following refinement of 
Theorem 2.2. 


Theorem 2.4. Assume that the loss function £ is convex in its first argument and that it 
takes values in [0, 1]. Then for any n > 0 the regret of the exponentially weighted average 
forecaster satisfies 

~ L% + 1n N 

Ly < fest a aay 

1—e” 

It is easy to see that, in some cases, an uninformed choice of 7 can still lead to a good regret 
bound. 


Corollary 2.3. Assume that the exponentially weighted average forecaster is used with 
n = 1. Then, under the conditions of Theorem 2.4, 
~ e 
L, < — (L +n N). 
<—£ (1; +I) 
This bound is much better than the general bound of Theorem 2.2 if L < ,/n, but it may 
be much worse otherwise. 

We now derive a new bound by tuning 7 in Theorem 2.4 in terms of the total loss L* of 
the best expert. 


Corollary 2.4. Assume the exponentially weighted average forecaster is used with n = 
In (1 + /(2In N)/L*), where L} > Ois supposed to be known in advance. Then, under the 
conditions of Theorem 2.4, 


L, —L* < /2L* nN +N. 


Proof. Using Theorem 2.4, we just need to show that, for our choice of n, 
InN L* 
a <L*+InN +/2L* InN. (2.2) 
— E~ 
We start from the elementary inequality (e” — e~”)/2 = sinh(n) > 7, which holds for all 
n > 0. Replacing the 7 in the numerator of the left-hand side of (2.2) with this upper bound, 
we find that (2.2) is implied by 
InN l1+e"” 
1— e7 2e7" 


Li <Li+inN + /2L* InN. 


The proof is concluded by noting that the above inequality holds with equality for our 
choice of n. E 
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Of course, the quantity L* is only available after the nth round of prediction. The lack 
of this information may be compensated by letting n change according to the loss of the 
currently best expert, similarly to the way shown in Section 2.3 (see Exercise 2.10). The 
regret bound that is obtainable via this approach is of the form 2,/2L* In N + c ln N, where 
c > lisa constant. Note that, similarly to Theorem 2.3, the use of a time-varying parameter 
n, leads to a bound whose leading term is twice the one obtained when 7 is fixed and chosen 
optimally on the basis of either the horizon n (as in Theorem 2.2) or the loss L* of the best 
expert (as in Corollary 2.4). 


Proof of Theorem 2.4. The proof is a simple modification of that of Theorem 2.2. The only 
difference is that Hoeffding’s inequality is now replaced by Lemma A.3 (see the Appendix). 
Recall from the proof of Theorem 2.2 that 


=—yL* -InN sino = Ding 2h Dis Wit 
= t-1 


t=1 pe Wied 


e =el fiye) 


We apply Lemma A.3 to the random variable X, that takes value £( fit, yt) with probability 
Wi t-1/Wr-1 foreachi = 1,..., N. Note that by convexity of the loss function and Jensen’s 
inequality, EX, > €(p;, y;) and therefore, by Lemma A.3, 


N —ne , 
aoe Wi1-1€ NEC fit Yi) 


N 
Èj Wj 


In 


=In i [e o] < (e n 1) UX, < (e” = 1) Lr, Yr)- 


Thus, 
n N E, fit Ye) 
w e” PN 
—nL -InN < $` In Dist i <(e"—-1)L,. 
t=1 j=1 Vitel 


Solving for Ls yields the result. W 


2.5 Forecasters Using the Gradient of the Loss 


Consider again the exponentially weighted average forecaster whose predictions are defined 
by 


ae Wua fiu 
Pes SN, O 
Èj Wj 

where the weight w; ,_1 for expert i at round t is defined by w; ,-1 = eis! Tn this section 
we introduce and analyze a different exponentially weighted average forecaster in which 
the cumulative loss L; —ı appearing at the exponent of w; s—1 is replaced by the gradient 
of the loss summed up to time t — 1. This new class of forecasters will be generalized in 
Chapter 11, where we also provide extensive analysis and motivation. 

Recall that the decision space D is a convex subset of a linear space. Throughout this 
section, we also assume that D is finite dimensional, though this assumption can be relaxed 
easily. If £ is differentiable, we use V£(p, y) to denote its gradient with respect to the first 
argument p € D. 
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Define the gradient-based exponentially weighted average forecaster by 


a 


Lia exp (—n Dizi VEe Ye) fis) far 
Pi = : 


aa exp (= Fi VLPs, Ys): fis) 


To understand the intuition behind this predictor, note that the weight assigned to expert i is 
small if the sum of the inner products V£(Ps, Ys) « fi,s has been large in the past. This inner 
product is large if expert i’s advice f; s points in the direction of the largest increase of the 
loss function. Such a large value means that having assigned a little bit larger weight to this 
expert would have increased the loss suffered at time s. According to the philosophy of the 
gradient-based exponentially weighted average forecaster, the weight of such an expert has 
to be decreased. 

The predictions of this forecaster are, of course, generally different from those of the 
standard exponentially weighted average forecaster. However, note that in the special case of 
binary prediction with absolute loss (i.e., if D = [0, 1], Y = {0, 1} and £(x, y) = |x — y|), 
a setup that we study in detail in Chapter 8, the predictions of the two forecasters are 
identical (see the exercises). 

We now show that, under suitable conditions on the norm of the gradient of the loss, the 
regret of the new forecaster can be bounded by the same quantity that was used to bound 
the regret of the standard exponentially weighted average forecaster in Corollary 2.2. 


Corollary 2.5. Assume that the decision space D is a convex subset of the euclidean unit ball 
{q € R¢: |iql| < 1}, the loss function £ is convex in its first argument and that its gradient 
V£ exists and satisfies ||V£|| < 1. For any n and n > 0, and for all y,,..., Yn € Y, the 
regret of the gradient-based exponentially weighted average forecaster satisfies 


~ 2 InN n 
Ly min Lin <—+—. 
i) pee N n 2 
Proof. The weight vector w,—1 = (W1,1-1, ---, WyN,t—1) used by this forecaster has com- 


ponents 


t—1 
Wi t—1 = exp (-» y VLDs, Ys) $ fa) : 


s=l 


Observe that these weights correspond to the exponentially weighted average forecaster 
based on the loss function £’, defined at time t by 


L(g. yi) = +> Ver, Ye), qéeD. 


By assumption, 7’ takes values in [—1, 1]. Applying Theorem 2.2 after rescaling ¢’ in [0, 1] 
(see Section 2.6), we get 


max Le = fit) VAa y) = YI Br yd) min | 2 Cin YW) 
I= t= 


i= 
t=1 


PLY fC 


3 InN nn 
< Pa 7 
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The proof is completed by expanding £( f; +, yr) around €(p;, yr) as follows: 


LP, yr) E Ll fit yı) Z (Pr c= fit) : Ve(P:, yı), 


which implies that 


n n 
L,- min Lin <) er ys) — min | dE is y) m 
t= 


mp e 


2.6 Scaled Losses and Signed Games 


Up to this point we have always assumed that the range of the loss function £ is the 
unit interval [0, 1]. We now investigate how scalings and translations of this range affect 
forecasting strategies and their performance. 

Consider first the case of a loss function £ € [0, M]. If M is known, we can run the 
weighted average forecaster on the scaled losses €/M and apply without any modification 
the analysis developed for the [0, 1] case. For instance, in the case of the exponentially 
weighted average forecaster, Corollary 2.4, applied to these scaled losses, yields the regret 
bound 


L, — L* < /2L*MInN +M nN. 


The additive term M In N is necessary. Indeed, if £ is such that for all p € D there exist 
p' € D and y € Y such that £(p, y) = M and £(p', y) = 0, then the expert advice can be 
chosen so that any forecaster incurs a cumulative loss of at least M log N on some outcome 
sequence with L} = 0. 

Consider now the translated range [— M, 0]. If we interpret negative losses as gains, we 
may introduce the regret G* — G, measuring the difference between the cumulative gain 
G* = —L* = maxi=1,....N(—Li,n) of the best expert and the cumulative gain Gn = =e of 
the forecaster. As before, if M is known we can run the weighted average forecaster on 
the scaled gains (—£)/M and apply the analysis developed for [0, 1]-valued loss functions. 
Adapting Corollary 2.4 we get a bound of the form 


G*—G, < /2G7MInN+MWN. 


Note that the regret now scales with the largest cumulative gain G%. 

We now turn to the general case in which the forecasters are scored using a generic 
payoff function h : D x Y — R, concave in its first argument. The goal of the forecaster is 
to maximize his cumulative payoff. The corresponding regret is defined by 


ert 


max h( i,ts 1) = hp, 1) = H; — Ay. 
_max | 2 fisy 2 Diy 
If the payoff function A has range [0, M], then it is a gain function and the forecaster plays 
a gain game. Similarly, if h has range in [—M, 0], then it is the negative of a loss function 
and the forecaster plays a /oss game. Finally, if the range of h includes a neighborhood of 
0, then the game played by the forecaster is a signed game. 

Translated to this terminology, the arguments proposed at the beginning of this sec- 
tion say that in any unsigned game (i.e., any loss game [—M, 0] or gain game [0, M]), 
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rescaling of the payoffs yields a regret bound of order ,/|H;*|M InN whenever M 
is known. In the case of signed games, however, scaling is not sufficient. Indeed, if 
h € [—M, M], then the reduction to the [0, 1] case is obtained by the linear transformation 
ht> (h + M)/(2M). Applying this to the analysis leading to Corollary 2.4, we get the regret 
bound 


H* — A, < (MH + MnM nN) + 2M InN = 0 (mvn in) 


This shows that, for signed games, reducing to the [0, 1] case might not be the best thing 
to do. Ideally, we would like to replace the factor n in the leading term with something like 
Alfi YOI +--+ + IAC fin, yn)| for an arbitrary expert 7. In the next sections we show that, 
in certain cases, we can do even better than that. 


2.7 The Multilinear Forecaster 


Potential functions offer a convenient tool to derive weighted average forecasters. However, 
good forecasters for signed games can also be designed without using potentials, as shown 
in this section. 

Fix a signed game with payoff function h : D x Y — R, and consider the weighted 
average forecaster that predicts, at time t, 


N 
Dee Wit fit 
bi 


Pem Wi-1 


where W,_; = ae W jt—1. The weights w; +—1ı of this forecaster are recursively defined 
as follows 


ei aie 1 ift=0 
O | wiil + nAi, y) otherwise, 


where 7 > 0 is a parameter of the forecaster. Because w;, is a multilinear form of the 
payoffs, we call this the multilinear forecaster. 

Note that the weights w1;,...,wy,, cannot be expressed as functions of the regret 
R, of components H; n — H,. On the other hand, since (1 + nh) ~ e”", the regret of the 
multilinear forecaster can be bounded via a technique similar to the one used in the proof of 
Theorem 2.2 for the exponentially weighted average forecaster. We just need the following 
simple lemma (proof is left as exercise). 


Lemma 2.4. For all z > —1/2, n(1 +z) > z — 2°. 


The next result shows that the regret of the multilinear forecaster is naturally expressed in 
terms of the squared sum of the payoffs of an arbitrary expert. 


Theorem 2.5. Assume that the payoff function h is concave in its first argument and satisfies 
h € [—M, œ). For any n and 0 < n < 1/(2M), and for all y,,..., Y, E€ Y, the regret of 
the multilinear forecaster satisfies 
~ InN 7 
Hin — Hy < ar +n Soh fis y for eachi =1,...,N. 


t=1 
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Proof. For any i = 1,..., N, note that A( fit, y) => —M and n < 1/(2M) imply that 
nh( fit, Yt) = —1/2. Hence, we can apply Lemma 2.4 to nh( fit, Y1) and get 


Wn 
In — 


a —InN +In] [(1 + Afi y) 


t=1 


= -InN + X In(1 + hie ye) 


t=1 


—InN + Y (nh( fies Y) — Phin Yo) 


t=1 


IV 


=N + 9Hin = YO hi y. 


t=1 


On the other hand, 


n N 


<n) Dihi y) (since nA +x) < x for x > —1) 


t=1 i=l 
< nf, (since h(-, y) is concave). 


Combining the upper and lower bounds of In(W,,/Wo), and dividing by n > 0, we get the 
claimed bound. W 


Let OF SMe yD) ++ A, fkn, aye where k is such that Hy, =H* = 
maxj=1,....v Hin. If n is chosen using Q*, then Theorem 2.5 directly implies the following. 


eee 


Corollary 2.6. Assume the multilinear forecaster is run with 


. 1 InN 
n = min | —., ; 
2M’ \ O; 


where Q* > 0 is supposed to be known in advance. Then, under the conditions of Theo- 
rem 2.5, 


H*—H <2/0% mN +4M InN. 


To appreciate this result, consider a loss game with h € [—M, 0] and let Ly = — max; H;,y. 
As Q* < M L}, the performance guarantee of the multilinear forecaster is at most a factor 
of 2 larger than that of the exponentially weighted average forecaster, whose regret in 
this case has the leading term ,/2L* M In N (see Section 2.4). However, in some cases OF 
may be significantly smaller than M L}, so that the bound of Corollary 2.6 presents a real 
improvement. In Section 2.8, we show that a more careful analysis of the exponentially 
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weighted average forecaster yields similar (though noncomparable) second-order regret 
bounds. 

It is still an open problem to obtain regret bounds of order ,/Q* In N without exploiting 
some prior knowledge on the sequence y1, .. . , Yn (see Exercise 2.14). In fact, the analysis 
of adaptive tuning techniques, such as the doubling trick or the time-varying n, rely on 
the monotonicity of the quantity whose evolution determines the tuning strategy. On the 
other hand, the sequence Q7, Q3, ... is not necessarily monotone as QF and Q7,, cannot 
generally be related when the experts achieving the largest cumulative payoffs at rounds t 
and ¢t + 1 are different. 


2.8 The Exponential Forecaster for Signed Games 


A slight modification of our previous analysis is sufficient to show that the exponentially 
weighted average forecaster also is able to achieve a small regret in signed games. Like 
the multilinear forecaster of Section 2.7, this new bound is expressed in terms of sums of 
quadratic terms that are related to the variance of the experts’ losses with respect to the 
distribution induced by the forecaster’s weights. Furthermore, the use of a time-varying 
potential allows us to dispense with the need of any preliminary knowledge of the best 
cumulative payoff H%. 

We start by redefining, for the setup where payoff functions are used, the exponentially 
weighted average forecaster with time-varying potential introduced in Section 2.3. Given a 
payoff function h, this forecaster predicts with P, = eae 1 fiWit-1/Wr-1, where W,—1 = 
ae, Wij, Wirt = eM, Hit = AC fia, Y1) +++ + Alfi, Ye-1), and we assume 
that the sequence n1, n2, . . . of parameters is positive. Note that the value of nı is immaterial 
because Hj 9 = 0 for all i. 

Let X1, X2,... be random variables such that X, = A( fit, yı) with probability 
w;..-1/W,_1 for all i and t. The next result, whose proof is left as an exercise, bounds 
the regret of the exponential forecaster for any nonincreasing sequence of potential param- 
eters in terms of the process X;,..., Xn. Note that this lemma does not assume any 
boundedness condition on the payoff function. 


Lemma 2.5. Let h be a payoff function concave in its first argument. The exponentially 
weighted average forecaster, run with any nonincreasing sequence 1, m, . . . of parameters 


satisfies, for any n > 1 and for any sequence y,,..., Yn of outcomes, 
2 1 “1 
Ay — H, < (= — 2) InN + oy — Ink [er %0], 
Nn+1 1 ai 
Let 


v =J var(Xs) = P E| (x -EX,)']. 


Our next result shows that, with an appropriate choice of the sequence 7,, the regret of 
the exponential forecaster at time n is at most of order ~ V, InN. Note, however, that the 
bound is not in closed form as V,, depends on the forecaster’s weights w; for alli and t. 
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Theorem 2.6. Let h be a [—M, M]-valued payoff function concave in its first argument. 
Suppose the exponentially weighted average forecaster is run with 


1 2/2 =1) [nN 
ny = min (v2—1) jin a. =L, 
2M e—2 Vi-4 


Then, for any n > 1 and for any sequence yı, ..., Yn of outcomes, 


H* —H, <4/V, nN +4M InN + (e —2)M. 
Proof. For brevity, write 


2(/2 - 1) 


C= 
e—2 


We start by applying Lemma 2.5 (with, say, nı = 72) 


R 2 1 oh 
Hà — H, < =N — In E fe” “72X9 
(ea pe 


Nn+1 


n 


< 2 max foun, Evn] ye p [em X% EX0] | 


aqelt 


Since n: < 1/(2M), n,(X,; — EX,) < 1 and we may apply the inequality e* < 1 +x + 
(e — 2)x? for all x < 1. We thus find that 


~ 1 n 
H% — H, < 2max {2m InN, gyv» mN) +(e— 25` n: var(X +). 
t=1 
Now denote by T the first time step £ when V, > M?. Using n, < 1/(2M) for all t and 
Vr < 2M?, we get 


5 n var(X;) < M + Ss n var(X;). 


t=1 t=T+1 


We bound the sum using n: < C./Un N)/V;_; fort > 2 (note that, fort > T,V;-1 > Vr > 
M? > 0). This yields 


k C i. 
y n, var(X,;) < CVInN eee 
t=T+1 t=T+1 ` 


Let v; = var(X;) = V; — V;_1. Since V, < V;_1 + M? and V,_,; > M?, we have 


— nie GW) E 


Therefore, 


E mvae < LEN (WW, - Vir) < VN, 


t=T+1 


2.9 Simulatable Experts 29 


Substituting our choice of C and performing trivial overapproximations concludes the 
proof. E 


Remark 2.2. The analyses proposed by Theorem 2.5, Corollary 2.6, and Theorem 2.6 
show that the multilinear forecaster and the exponentially weighted average forecaster 
work, with no need of translating payoffs, in both unsigned and signed games. In addition, 
the regret bounds shown in these results are potentially much better than the invariant bound 
MvVnInN obtained via the explicit payoff transformation h > (h + M)/(2M) from signed 
to unsigned games (see Section 2.6). However, none of these bounds applies to the case 
when no preliminary information is available on the sequence of observed payoffs. 


The main term of the bound stated in Theorem 2.6 contains V,,. This quantity is smaller 
than all quantities of the form 


ys a (A fits Y) — be)? 


where u1, u2, ... is any sequence of real numbers that may be chosen in hindsight, as it 
is not required for the definition of the forecaster. This gives us a whole family of upper 
bounds, and we may choose for the analysis the most convenient sequence of us. 

To provide a concrete example, denote the effective range of the payoffs at time 
t by R, = maxj=1,...,n AC fit Yt) — minj=t,..,n A(S jt» Yt) and consider the choice m, = 


Corollary 2.7. Under the same assumptions as in Theorem 2.6, 


H* —H, <2 (InN) > R? +4M InN +(e — 2)M. 
t=1 


In a loss game, where h has the range [—M, 0], Corollary 2.7 states that the regret is 
bounded by a quantity dominated by the term 2M vn In N. A comparison with the bound 
of Theorem 2.3 shows that we have only lost a factor /2 to obtain a much more general 
result. 


2.9 Simulatable Experts 


So far, we have viewed experts as unspecified entities generating, at each round f, an 
advice to which the forecaster has access. A different setup is when the experts them- 
selves are accessible to the forecaster, who can make arbitrary experiments to reveal their 
future behavior. In this scenario we may define an expert E using a sequence of func- 
tions fe, : ŅX' -1 > D,t=1,2,... such that, at every time instant t, the expert predicts 
according to f¢.(y’~!). We also assume that the forecaster has access to these func- 
tions and therefore can, at any moment, hypothesize future outcomes and compute all 
experts’ future predictions for that specific sequence of outcomes. Thus, the forecaster 
can “simulate” the experts’ future reactions, and we call such experts simulatable. For 
example, a simulatable expert for a prediction problem where D = Y is the expert E such 


30 Prediction with Expert Advice 


that fg (yT) = (y1 +--+: + y:-1)/(t — 1). Note that we have assumed that at time f the 
prediction of a simulatable expert only depends on the sequence y'~! of past observed out- 
comes. This is not true for the more general type of experts, whose advice might depend on 
arbitrary sources of information also hidden from the forecaster. More importantly, while 
the prediction of a general expert, at time t, may depend on the past moves 7, ..., D;—1 of 
the forecaster (just recall the protocol of the game of prediction with expert advice), the val- 
ues a simulatable expert outputs only depend on the past sequence of outcomes. Because 
here we are not concerned with computational issues, we allow fz, to be an arbitrary 
function of y’~! and assume that the forecaster may always compute such a function. 

A special type of simulatable expert is a static expert. An expert E is static when its 
predictions fz, only depend on the round index f and not on y'~!. In other words, the 
functions fz, are all constant valued. Thus, a static expert E is completely described 
by the sequence fz1, fe.2,... of its predictions at each round t. This sequence is fixed 
irrespective of the actual observed outcomes. For this reason we use f = (fi, f2,...) to 
denote an arbitrary static expert. Abusing notation, we use f also to denote simulatable 
experts. 

Simulatable and static experts provide the forecaster with additional power. It is then 
interesting to consider whether this additional power could be exploited to reduce the 
forecaster’s regret. This is investigated in depth for specific loss functions in Chapters 8 
and 9. 


2.10 Minimax Regret 


In the model of prediction with expert advice, the best regret bound obtained so far, which 
holds for all [0, 1]-valued convex losses, is ./(n/2)InN. This is achieved (for any fixed 
n > 1) by the exponentially weighted average forecaster. Is this the best possible uniform 
bound? Which type of forecaster achieves the best regret bound for each specific loss? To 
address these questions in a rigorous way we introduce the notion of minimax regret. Fix 
a loss function £ and consider N general experts. Define the minimax regret at horizon 
n by 


V™ = sup inf sup sup inf sup 
CfitsesfvsyeD® PIEP EY (fiayunfna)eDY PEP prey 


sup inf sup (3 Li, Yr) — „min, Ss Lit ») ; 
Toet j=l 


Gins fN a)EDN PaEP yey V 


An equivalent, but simpler, definition of minimax regret can be given using static experts. 
Define a strategy for the forecaster as a prescription for computing, at each round t, the pre- 


diction P, given the past t — 1 outcomes y1, ..., y;-; and the expert advice (f1,5,..-, fw.s) 
for s=1,...,¢. Formally, a forecasting strategy P is a sequence Pj, p2,... of 
functions 


Pi : YT! x (Dy >D. 


Now fix any class F of N static experts and let L,(P, F, y”) be the cumulative loss on the 
sequence y” of the forecasting strategy P using the advice of the experts in F. Then the 
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minimax regret V~ can be equivalently defined as 


n 
V%) = inf sup sup max (ZL (P, Fy") -YU DN, 
i P {F :|F|=N} yreyr i=l ae N n A i,t» Yt 


where the infimum is over all forecasting strategies P and the first supremum is over all 
possible classes of N static experts (see Exercise 2.18). 

The minimax regret measures the best possible performance guarantee one can have 
for a forecasting algorithm that holds for all possible classes of N experts and all outcome 
sequences of length n. An upper bound on V{™ establishes the existence of a forecasting 
strategy achieving a regret not larger than the upper bound, regardless of what the class of 
experts and the outcome sequence are. On the other hand, a lower bound on V,““? shows 
that for any forecasting strategy there exists a class of N experts and an outcome sequence 
such that the regret of the forecaster is at least as large as the lower bound. 

In this chapter and in the next, we derive minimax regret upper bounds for several losses, 
including V“) < InN for the logarithmic loss €(x, y) = —Ijy=1} In x — Ij)<o; nC. — x), 
where x € [0, 1]and y € {0, 1}, and V® < /(n/2)InN forall [0, 1]-valued convex losses, 
both achieved by the exponentially weighted average forecaster. In Chapters 3, 8, and 9 we 
complement these results by proving, among other related results, that V = In N for the 
logarithmic loss provided that n > log, N and that the minimax regret for the absolute loss 
L(x, y) = |x — y| is asymptotically ./(7/2) In N, matching the upper bound we derived for 
convex losses. This entails that the exponentially weighted average forecaster is minimax 
optimal, in an asymptotic sense, for both the logarithmic and absolute losses. 

The notion of minimax regret defined above is based on the performance of any forecaster 
in the case of the worst possible class of experts. However, often one is interested in the best 
possible performance a forecaster can achieve compared with the best expert in a fixed class. 
This leads to the definition of minimax regret for a fixed class of (simulatable) experts as 
follows. Fix some loss function £ and let F be a (not necessarily finite) class of simulatable 
experts. A forecasting strategy P based on F is now just a sequence P4, P2, ... of functions 
Pı : YT! = D. (Note that P, implicitly depends on F, which is fixed. Therefore, as the 
experts in F are simulatable, p; need not depend explicitly on the expert advice.) The 
minimax regret with respect to F at horizon n is then defined by 


V,(F) = inf sup (XORO, y) = int > ef). ye) | - 
P yneyn ay SEF arr 
This notion of regret is studied for specific losses in Chapters 8 and 9. 

Given a class F of simulatable experts, one may also define the closely related quantity 


U,(F) = sup inf I (Eron yr) — inf X U( LOT), »)) dQ”), 
o P Jy (W Teraa 
where the supremum is taken over all probability measures over the set Y” of sequences of 
outcomes of length n. U (F) is called the maximin regret with respect to F. Of course, to 
define probability measures over Y”, the set VY of outcomes should satisfy certain regularity 
properties. For simplicity, assume that V is a compact subset of R“. This assumption is 
satisfied for most examples that appear in this book and can be significantly weakened if 
necessary. A general minimax theorem, proved in Chapter 7, implies that if the decision 
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space D is convex and the loss function £ is convex and continuous in its first argument, 
then 


Vi(F) = Un(F). 


This equality follows simply by the fact that the function 


F(P,Q)= I l (» UBOD, y) — inf X OTD, »)) dQ") 
t=1 t=1 


is convex in its first argument and concave (actually linear) in the second. Here we 
define a convex combination AP“ + (1 — 2)P® of two forecasting strategies P® = 
PP, py, ...)and PO = PP, pP ,...) by a forecaster that predicts, at time t, according 
to 


Kot + A Sap oY: 


We leave the details of checking the conditions of Theorem 7.1 to the reader (see Exer- 
cise 2.19). 


2.11 Discounted Regret 


In several applications it is reasonable to assume that losses in the past are less significant 
than recently suffered losses. Thus, one may consider discounted regrets of the form 


n 
Pin = ) Bn—tTi,t» 
t=1 


where the discount factors £, are typically decreasing with t and r; = (Dr, Ye) — €Cfi.t, Ye) 
is the instantaneous regret with respect to expert i at round t. In particular, we assume that 
Bo = Bi => Bo => ---is a nonincreasing sequence and, without loss of generality, we let 
Bo = 1. Thus, at time ¢ = n, the actual regret 7;,, has full weight while regrets suffered in 
the past have smaller weight; the more distant the past, the less its weight. 

In this setup the goal of the forecaster is to ensure that, regardless of the sequence of 
outcomes, the average discounted cumulative regret 


pee Bn—t¥i,t 
i=1,...,N gf Past 


is as small as possible. More precisely, one would like to bound the average discounted 
regret by a function of n that converges to zero as n — oo. The purpose of this section is to 
explore for what sequences of discount factors it is possible to achieve this goal. The case 
when £; = 1 for all tf corresponds to the case studied in the rest of this chapter. Other natural 
choices include the exponential discount sequence 8, = a~‘ for some a > 1 or sequences 
of the form f,; = (t + 1)~% witha > 0. 

First we observe that if the discount sequence decreases too quickly, then, except for 
trivial cases, there is no hope to prove any meaningful bound. 


Theorem 2.7. Assume that there is a positive constant c such that for each n there 
exist outcomes yı, y2 E€ Y and two experts i #i' such that i = argmin; (fin, y1), 
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v= argmin ; L fins y2), and MiINy=y,,y5 (fins Y) — fins WI zc. If Sj Êi < œ, 
then there exists a constant C such that, for any forecasting strategy, there is a sequence of 
outcomes such that 


n 
r; 
max ye 1 n- tl i,t >C 


ESI N Na 1 Ba t = 
for all n. 


Proof. The lower bound follows simply by observing that the weight of the regrets at the 
last time step f = n is too large: it is comparable with the total weight of the whole past. 
Formally, 


aax Si int 21 Nee Bolin = EPn, Vins min, li. n» Yn). 


t=1 


Thus, 


re supycy(€(Pn, Y) — minj=r,....v (fins Y)) 

~ Zizo Bi 

> a | 

= eye 
Next we contrast the result by showing that whenever the discount factors decrease suffi- 
ciently slowly such that $2729 8, = œ, it is possible to make the average discounted regret 
vanish for large n. This follows from an easy application of Theorem 2.1. We may define 
weighted average strategies on the basis of the discounted regrets simply by replacing r;,, 
by T; = Bn—tri.ı in the definition of the weighted average forecaster. Of course, to use such 
a predictor, one needs to know the time horizon n in advance. We obtain the following. 


Theorem 2.8. Consider a discounted polynomially weighted average forecaster defined, 
fort =1,...,n, by 
N -1% N zj 
~ ei AVDE Fis) fis - pare OO a Bitlis 
t= N F -lix = N —1 z 
i= $ (© Fis) poe AODA Ba=sT j.s) 


where '(x) = (p — 1)x?, with p = 21n N. Then the average discounted regret satisfies 


„max Porz Brari Ênis < /2e In N S Ximi Pi 


oN Xia 1 Bn-1 z Di Bn 


(A similar bound may be proven for the discounted exponentially weighted average fore- 
caster as well.) In particular, if Xo B; = œ, then 


erat Bni =1 fn- tři,t 


_max = o(1). 


’ 


34 Prediction with Expert Advice 


Proof. Clearly the forecaster satisfies the Blackwell condition, and for eacht, |F; s| < By—r. 
Then Theorem 2.1 implies, just as in the proof of Corollary 2.1, 


i= 


Jaris 


n 
MaX Pi.n < V2elnN 2 B 
t= 


and the first statement follows. To prove the second, just note that 


V Dia Bat £ Ma Pt 


V2e ln N- E 
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It is instructive to consider the special case when 6, = (t + 1)“ for some 0 <a < 1. 
(Recall from Theorem 2.7 that for a > 1, no meaningful bound can be derived.) If a = 1, 
Theorem 2.8 implies that 


< V2eln N 


aK Xa Bn—tfi,t < C 
lai Yer 4 PAS logn 
for a constant C > 0. This slow rate of convergence to zero is not surprising in view of 
Theorem 2.7, because the series $`, 1/(t + 1) is “barely nonsummable.” In fact, this bound 
cannot be improved substantially (see Exercise 2.20). However, for a < 1 the convergence 
is faster. In fact, an easy calculation shows that the upper bound of the theorem implies 
that 


O(1/logn) ifa=1 


Baits O(n) if1/2<a<1 
max -—_ = 


i=1,...N oP) bnt O(./dogn)/n) ifa=1/2 
O (1/./n) ifa < 1/2. 


Not surprisingly, the slower the discount factor decreases, the faster the average discounted 
cumulative regret converges to zero. However, it is interesting to observe the “phase tran- 
sition” occurring at a = 1/2: for alla < 1/2, the average regret decreases at a rate n7 !/?, 
a behavior quantitatively similar to the case when no discounting is taken into account. 


2.12 Bibliographic Remarks 


Our model of sequential prediction with expert advice finds its roots in the theory of 
repeated games. Zero-sum repeated games with fixed loss matrix are a classical topic 
of game theory. In these games, the regret after n plays is defined as the excess loss of 
the row player with respect to the smallest loss that could be incurred had he known in 
advance the empirical distribution of the column player actions during the n plays. In his 
pioneering work, Hannan [141] devises a randomized playing strategy whose per-round 
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expected regret grows at rate N./3nm/2, where N is the number of rows, m is the number 
of columns in the loss matrix, and n is the number of plays. As shown in Chapter 7, our 
polynomially and exponentially weighted average forecasters can be used to play zero- 
sum repeated games achieving the regret ./(n/2) In N. We obtain the same dependence on 
the number n of plays Compared with Hannan’s regret, but we significantly improve the 
dependence on the dimensions, N and m, of the loss matrix. A different randomized player 
with a vanishing per-round regret can be also derived from the celebrated Blackwell’s 
approachability theorem [28], generalizing von Neumann’s minimax theorem to vector- 
valued payoffs. This result, which we re-derive in Chapter 7, is based on a mixed strategy 
equivalent to our polynomially weighted average forecaster with p = 2. Exact asymptotical 
constants for the minimax regret (for a special case) were first shown by Cover [68]. In our 
terminology, Cover investigates the problem of predicting a sequence of binary outcomes 
with two static experts, one always predicting 0 and the other always predicting 1. He shows 
that the minimax regret for the absolute loss in this special case is (1 + 0(1)),/n/(2z). 

The problem of sequential prediction, deprived of any probabilistic assumption, is 
deeply connected with the information-theoretic problem of compressing an individual 
data sequence. A pioneering research in this field was carried out by Ziv [317,318] and 
Lempel and Ziv [197,317,319], who solved the problem of compressing an individual data 
sequence almost as well as the best finite-state automaton. As shown by Feder, Merhav, and 
Gutman [95], the Lempel—Ziv compressor can be used as a randomized forecaster (for the 
absolute loss) with a vanishing per-round regret against the class of all finite-state experts, a 
surprising result considering the rich structure of this class. In addition, Feder, Merhav, and 
Gutman devise, for the same expert class, a forecaster with a convergence rate better than 
the rate provable for the Lempel—Ziv forecaster (see also Merhav and Feder [213] for fur- 
ther results along these lines). In Section 9 we continue the investigation of the relationship 
between prediction and compression showing simple conditions under which prediction 
with logarithmic loss is minimax equivalent to adaptive data compression. Connections 
between prediction with expert advice and information content of an individual sequence 
have been explored by Vovk and Watkins [303], who introduced the notion of predictive 
complexity of a data sequence, a quantity that, for the logarithmic loss, is related to the 
Kolmogorov complexity of the sequence. We refer to the book of Li and Vitanyi [198] for 
an excellent introduction to the algorithmic theory of randomness. 

Approximately at the same time when Hannan and Blackwell were laying down the 
foundations of the game-theoretic approach to prediction, Solomonoff had the idea of 
formalizing the phenomenon of inductive inference in humans as a sequential prediction 
process. This research eventually led him to the introduction of a universal prior probabil- 
ity [273-275], to be used as a prior in bayesian inference. An important “side product” of 
Solomonoff’s universal prior is the notion of algorithmic randomess, which he introduced 
independently of Kolmogorov. Though we acknowledge the key role played by Solomonoff 
in the field of sequential prediction theory, especially in connection with Kolmogorov com- 
plexity, in this book we look at the problem of forecasting from a different angle. Having 
said this, we certainly think that exploring the connections between algorithmic randomness 
and game theory, through the unifying notion of prediction, is a surely exciting research 


plan. 
The field of inductive inference investigates the problem of sequential prediction when 


experts are functions taken from a large class, possibly including all recursive languages or 
all partial recursive functions, and the task is that of eventually identifying an expert that 
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is consistent (or nearly consistent) with an infinite sequence of observations. This learning 
paradigm, introduced in 1967 by Gold [130], is still actively studied. Unlike the theory 
described in this book, whose roots are game theoretic, the main ideas and analytical tools 
used in inductive inference come from recursion theory (see Odifreddi [227]). 

In computer science, an area related to prediction with experts is competitive analysis of 
online algorithms (see the monograph of Borodin and El-Yaniv [36] for a survey). A good 
example of a paper exploring the use of forecasting algorithms in competitive analysis is 
the work by Blum and Burch [32]. 

The paradigm of prediction with expert advice was introduced by Littlestone and 
Warmuth [203] and Vovk [297], and further developed by Cesa-Bianchi, Freund, Haus- 
sler, Helmbold, Schapire, and Warmuth [48] and Vovk [298], although some of its main 
ingredients already appear in the papers of De Santis, Markowski, and Wegman [260], 
Littlestone [200], and Foster [103]. The use of potential functions in sequential prediction 
is due to Hart and Mas-Colell [146], who used Blackwell’s condition in a game-theoretic 
context, and to Grove, Littlestone, and Schuurmans [133], who used exactly the same 
condition for the analysis of certain variants of the Perceptron algorithm (see Chapter 11). 
Our Theorem 2.1 is inspired by, and partially builds on, Hart and Mas-Colell’s analysis of 
A-strategies for playing repeated games [146] and on the analysis of the quasi-additive algo- 
rithm of Grove, Littlestone, and Schuurmans [133]. The unified framework for sequential 
prediction based on potential functions that we describe here was introduced by Cesa- 
Bianchi and Lugosi [54]. Forecasting based on the exponential potential has been used in 
game theory as a variant of smooth fictitious play (see, e.g., the book of Fudenberg and 
Levine [119]). In learning theory, exponentially weighted average forecasters were intro- 
duced and analyzed by Littlestone and Warmuth [203] (the weighted majority algorithm) 
and by Vovk [297] (the aggregating algorithm). The trick of setting the parameter p of 
the polynomial potential to 21In N is due to Gentile [123]. The analysis in Section 2.2 is 
based on Cesa-Bianchi’s work [46]. The idea of doubling trick of Section 2.3 appears in 
the articles of Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire, and Warmuth [48] and 
Vovk [298], whereas the analysis of Theorem 2.3 is adapted from Auer, Cesa-Bianchi, 
and Gentile [13]. The data-dependent bounds of Section 2.4 are based on two sources: 
Theorem 2.4 is from the work of Littlestone and Warmuth [203] and Corollary 2.4 is due 
to Freund and Schapire [112]. A more sophisticated analysis of the exponentially weighted 
average forecaster with time-varying 7, is due to Yaroshinski, El-Yaniv, and Seiden [315]. 
They show a regret bound of the order (1 + 0(1)),/2L* InN, where o(1) > 0 for L* —> oo. 
Hutter and Poland [165] prove a result similar to Exercise 2.10 using follow-the-perturbed- 
leader, a randomized forecaster that we analyze in Chapter 4. 

The multilinear forecaster and the results of Section 2.8 are due to Cesa-Bianchi, Man- 
sour, and Stoltz [57]. A weaker version of Corollary 2.7 was proven by Allenberg-Neeman 
and Neeman [7]. 

The gradient-based forecaster of Section 2.5 was introduced by Kivinen and War- 
muth [181]. The proof of Corollary 2.5 is due to Cesa-Bianchi [46]. The notion of simu- 
latable experts and worst-case regret for the experts’ framework was first investigated by 
Cesa-Bianchi et al. [48]. Results for more general loss functions are contained in Chung’s 
paper [60]. Fudenberg and Levine [121] consider discounted regrets is a somewhat different 
model than the one discussed here. 

The model of prediction with expert advice is connected to bayesian decision theory. For 
instance, when the absolute loss is used, the normalized weights of the weighted average 
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forecaster based on the exponential potential closely approximate the posterior distribution 
of a simple stochastic generative model for the data sequence (see Exercise 2.7). From this 
viewpoint, our regret analysis shows an example where the Bayes decisions are robust in 
a strong sense, because their performance can be bounded not only in expectation with 
respect to the random draw of the sequence but also for each individual sequence. 


2.1 


2.2 


2.3 


2.4 


2.5 


2.13 Exercises 


Assume that you have to predict a sequence Y;, Y2,... € {0, 1} of i.i.d. random variables with 
unknown distribution, your decision space is [0, 1], and the loss function is €(p, y) = |p — y|. 
How would you proceed? Try to estimate the cumulative loss of your forecaster and compare 
it to the cumulative loss of the best of the two experts, one of which always predicts 1 and 
the other always predicts 0. Which are the most “difficult” distributions? How does your 
(expected) regret compare to that of the weighted average algorithm (which does not “know” 
that the outcome sequence is i.i.d.)? 


Consider a weighted average forecaster based on a potential function 


N 
bu) =v (> ou) 
i=l 
Assume further that the quantity C(r,) appearing in the statement of Theorem 2.1 is bounded 
by a constant for all values of r, and that the function Y (¢(u)) is strictly convex. Show that 
there exists a nonnegative sequence £, — 0 such that the cumulative regret of the forecaster 
satisfies, for every n and for every outcome sequence y”, 


1 
—{ max Rin) < En- 
n NiS nN 


Analyze the polynomially weighted average forecaster using Theorem 2.1 but using the poten- 
tial function ®(u) = ||u+ ||, instead of the choice ®,(u) = ||u+ I3 used in the proof of Corol- 
lary 2.1. Derive a bound of the same form as in Corollary 2.1, perhaps with different constants. 
Let VY = {0, 1}, D = [0, 1], and £(, y) = |P — y|. Prove that the cumulative loss L of the 
exponentially weighted average forecaster is always at least as large as the cumulative loss 
min;<y L; of the best expert. Show that for other loss functions, such as the square loss 
(p — y)’, this is not necessarily so. Hint: Try to reverse the proof of Theorem 2.2. 
(Nonuniform initial weights) By definition, the weighted average forecaster uses uniform 
initial weights w;9 = 1 for all i = 1,..., N. However, there is nothing special about this 
choice, and the analysis of the regret for this forecaster can be carried out using any set of 
nonnegative numbers for the initial weights. 

Consider the exponentially weighted average forecaster run with arbitrary initial weights 
W1,0, ---, WN 0 > O, defined, for allt = 1, 2,..., by 


N e 
Xaia Wit-1 fit 


=N fisy 
pe N » Wit = Wire" Meno) 
Dejan Wj 
Under the same conditions as in the statement of Theorem 2.2, show that for every n and for 
every outcome sequence y”, 
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where Wo = Wino +-:-+Wwwo. 
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(Many good experts) Sequences of outcomes on which many experts suffer a small loss are 
intuitively easier to predict. Adapt the proof of Theorem 2.2 to show that the exponentially 
weighted forecaster satistifies the following property: for every n, for every outcome sequence 
y”, and for all L > 0, 

L, sir mna 

K n N 8 

where N; is the cardinality of the set {1 <i<N: Link L}. 
(Random generation of the outcome sequence) Consider the exponentially weighted average 
forecaster and define the following probabilistic model for the generation of the sequence 
y” € {0, 1}”, where we now view each bit y, as the realization of a Bernoulli random variable 
Y,. An expert J is drawn at random from the set of N experts. For each t = 1,..., 7, first 
X, € {0, 1} is drawn so that X, = 1 with probability f; +. Then Y, is set to X, with probability 8 
and is set to 1 — X, with probability 1 — 6, where 8 = 1/(1 + e™”). Show that the forecaster 
weights w; +/(w1 s +--+ wyn.) and are equal to the posterior probability P[I = i | Yı = 
yi,--->¥;-1 = yı] that expert i is drawn given that the sequence yı, ..., Yı has been 
observed. 


(The doubling trick) Consider the following forecasting strategy (“doubling trick”): time is 
divided in periods (2”, ... ,2"+! _ 1), where m = 0, 1,2, .... In period (2”,... ,2™+1 _ 1) 
the strategy uses the exponentially weighted average forecaster initialized at time 2” with 
parameter nm = ./8(n N)/2”. Thus, the weighted average forecaster is reset at each time 
instance that is an integer power of 2 and is restarted with a new value of 7. Using Theorem 2.2 
prove that, for any sequence y1, y2,... E€ Y of outcomes and for any n > 1, the regret of this 
forecaster is at most 


(The doubling trick, continued) In Exercise 2.8, quite arbitrarily, we divided time into periods 
of length 2”, m = 1, 2,.... Investigate what happens if instead the period lengths are of the 
form |a” | for some other value of a > 0. Which choice of a minimizes, asymptotically, the 
constant in the bound? How much can you gain compared with the bound given in the text? 
Combine Theorem 2.4 with the doubling trick of Exercise 2.8 to construct a forecaster that, 
without any previous knowledge of L*, achieves, for all n, 


L, — L* <2,/2L7 InN +clnN 
whenever the loss function is bounded and convex in its first argument, and where c is a 


positive constant. 


(Another time-varying potential) Consider the adaptive exponentially weighted average 
forecaster that, at time t, uses 


where c is a positive constant. Show that whenever £ is a [0, 1]-valued loss function convex in 
its first argument, then there exists a choice of c such that 


L, —L* <2,/2L7nN+«lnN, 
where Kx > 0 is an appropriate constant (Auer, Cesa-Bianchi, and Gentile [13]). Hint: Follow 


the outline of the proof of Theorem 2.3. This exercise is not easy. 


Consider the prediction problem with Y = D = [0, 1] with the absolute loss £(P, y) = |P — yl. 
Show that in this case the gradient-based exponentially weighted average forecaster coincides 
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with the exponentially weighted average forecaster. (Note that the derivative of the loss does 
not exist for P = y and the definition of the gradient-based exponentially weighted average 
forecaster needs to be adjusted appropriately.) 

Prove Lemma 2.4. 

Use the doubling trick to prove a variant of Corollary 2.6 in which no knowledge about 
the outcome sequence is assumed to be preliminarily available (however, as in the corollary, 
we still assume that the payoff function h has range [-M, M] and is concave in its first 
argument). Express the regret bound in terms of the smallest monotone upper bound on the 
sequence Q{, Q3,... (see Cesa-Bianchi, Mansour, and Stoltz [57]). 

Prove a variant of Theorem 2.6 in which no knowledge about the range [—M, M] of the payoff 
function is assumed to be preliminarily available (see Cesa-Bianchi, Mansour, and Stoltz [57]). 
Hint: Replace the term 1/(2M) in the definition of n, with 2-"+*), where k is the smallest 
nonnegative integer such that max,—),.;—1 maxj—),...v [AC fis, ys)| < 2% 

Prove a regret bound for the multilinear forecaster using the update w;,, = wi,-1(1 + Nri t), 
where ris = h( fit, Yt) — h(P;, yi) is the instantaneous regret. What can you say about the 
evolution of the total weight W, = w,, + +- + wy, of the experts? 

Prove Lemma 2.5. Hint: Adapt the proof of Theorem 2.3. 

Show that the two expressions of the minimax regret V,“) in Section 2.10 are equivalent. 


Consider a class F of simulatable experts. Assume that the set Y of outcomes is a compact 
subset of R¢, the decision space D is convex, and the loss function £ is convex and continuous 
in its first argument. Show that V, (F) = U, (F). Hint: Check the conditions of Theorem 7.1. 
Consider the discount factors 6, = 1/(t + 1) and assume that there is a positive constant c 
such that for each n there exist outcomes yı, y2 € VY and two experts i 4 i’ such that i = 
argmin &(fjns y1), i’ = argmin, (fn. y2), and miny—y, y Elfin Y) — fins YI = c. Show 
that there exists a constant C such that for any forecasting strategy, there is a sequence of 
outcomes such that 
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ELON Xai logn 


for all n. 
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Tight Bounds for Specific Losses 


3.1 Introduction 


In Chapter 2 we established the existence of forecasters that, under general circumstances, 
achieve a worst-case regret of the order of vn ln N with respect to any finite class of N 
experts. The only condition we required was that the decision space D be a convex set 
and the loss function £ be bounded and convex in its first argument. In many cases, under 
specific assumptions on the loss function and/or the class of experts, significantly tighter 
performance bounds may be achieved. The purpose of this chapter is to review various 
situations in which such improvements can be made. We also explore the limits of such 
possible improvements by exhibiting lower bounds for the worst-case regret. 

In Section 3.2 we show that in some cases a myopic strategy that simply chooses 
the best expert on the basis of past performance achieves a rapidly vanishing worst- 
case regret under certain smoothness assumptions on the loss function and the class 
of experts. 

Our main technical tool, Theorem 2.1, has been used to bound the potential ®(R,,) 
of the weighted average forecaster in terms of the initial potential (0) plus a sum of 
terms bounding the error committed in taking linear approximations of each ®(R,) for 
t=1,...,n. In certain cases, however, we can bound ®(R,,) directly by ®(0), without 
the need of taking any linear approximation. To do this we can exploit simple geometrical 
properties exhibited by the potential when combined with specific loss functions. In this 
chapter we develop several techniques of this kind and use them to derive tighter regret 
bounds for various loss functions. 

In Section 3.3 we consider a basic property of a loss function, exp-concavity, which 
ensures a bound of (InN)/7 for the exponentially weighted average forecaster, where 
n must be smaller than a critical value depending on the specific exp-concave loss. In 
Section 3.4 we take a more extreme approach by considering a forecaster that chooses 
predictions minimizing the worst-case increase of the potential. This “greedy” forecaster is 
shown to perform not worse than the weighted average forecaster. In Section 3.5 we focus 
on the exponential potential and refine the analysis of the previous sections by introducing 
the aggregating forecasters, a family that includes the greedy forecaster. These forecasters, 
which apply to exp-concave losses, are designed to achieve a regret of the form cln N, 
where c is the best possible constant for each loss for which such bounds are possible. In 
the course of the analysis we characterize the important subclass of mixable losses. These 
losses are, in some sense, the easiest to work with. Finally, in Section 3.7 we prove a lower 
bound of the form Q(log N) for generic losses and a lower bound for the absolute loss 
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that matches (constants included) the upper bound achieved by the exponentially weighted 
average forecaster. 


3.2 Follow the Best Expert 


In this section we study possibly the simplest forecasting strategy. This strategy chooses, at 
time f, an expert that minimizes the cumulative loss over the past t — 1 time instances. In 
other words, the forecaster always follows the expert that has had the smallest cumulative 
loss up to that time. Perhaps surprisingly, this simple predictor has a good performance 
under general conditions on the loss function and the experts. 

Formally, consider a class € of experts and define the forecaster that predicts the same 
as the expert that minimizes the cumulative loss in the past, that is, 


t-1 

Pi = fes if E = argmin X UC fes, Ys). 

E'eE yy 

Pı is defined as an arbitrary element of D. Throughout this section we assume that the 

minimum is always achieved. In case of multiple minimizers, E can be chosen arbitrarily. 

Our goal is, as before, to compare the performance of P with that of the best expert in 
the class; that is, to derive bounds for the regret 
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Consider the hypothetical forecaster defined by 


t 
př = fea if E= argmin X` e( fers. Ys) 
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Note that p* is defined like P,, with the only difference that p* also takes the losses suffered 
at time ¢ into account. Clearly, p* is not a “legal” forecaster, because it is allowed to peek 
into the future; it is defined as a tool for our analysis. The following simple lemma states 
that p* “performs” at least as well as the best expert. 


Lemma 3.1. For any sequence y,,..., Yn of outcomes, 
n n 
DEP), y) < 2 Pi yy) = min Len. 
t= t= 


Proof. The proof goes by induction. The statement is obvious for n = 1. Assume now 
that 


n—1 
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Since by definition $72} (p*_,, y) < 02] (p%, yı), the inductive assumption implies 


n—1 


n-1 
>> epi, y) = > eps, yo). 
t=1 t=1 

Add €(p*, yn) to both sides to obtain the result. lH 


This simple property implies that the regret of the forecaster P may be upper bounded 
as 


Ly — inf Len < X 0(€Pr y) = LP}, yo). 


t=1 


By recalling the definitions of p, and p*, it is reasonable to expect that in some situations 
P: and p* are close to each other. For example, if one can guarantee that for every t, 


sup(€(P;, y) — L(p7, y)) < & 
yey 


for a sequence of real numbers £, > 0, then the inequality shows that 


n 
Dn inf Len < De (3.1) 
In what follows, we establish some general conditions under which e; ~ 1/t, which implies 
that the regret grows as slowly as O (ln n). 
In all these examples we consider “constant” experts; that is, we assume that for each 
E € E and y € yV, the loss of expert E is independent of time. In other words, for any fixed 
y, (fei, Y) =- = (fen, yY). To simplify notation, we write €(E, y) for the common 
value, where now each expert is characterized by an element E € D. 


Square Loss 
As a first example consider the square loss in a general vector space. Assume that D = Y 
is the unit ball {p : ||p|| < 1} in a Hilbert space H, and consider the loss function 


Kp, y)=lp-yl,  p,yEeH. 


Let the expert class € contain all constant experts indexed by E € D. In this case it is easy 
to determine explicitly the forecaster P. Indeed, because for any p € D, 


1 t—1 
f= 1 > Yr Ys 
r=1 
(easily checked by expanding the squares), we have 


1 t—1 
nA 
s=1 


2 
+ 


t-1 2 


1 t—1 3 1 
pa le =» TA 


s=l 


t—1 


1 
ap pae 


r=1 


Similarly, 


1 t 
P; = F X ys 
s=l 
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Then, for any y € D, 


LB y) — (ps, y) = IP — yl’ — lp — yi? 


= (P: — př) - (P: + př — 2y) 
z4 WZ =p, | 
by boundedness of the set D. The norm of the difference can be easily bounded by writing 
t t—1 


e a aa Al EE a E yı 
Pr Pr= days == (4-7) a>- 


s=l1 s=1 


so that, clearly, regardless of the sequence of outcomes, ||P; — p7|| < 2/t. We obtain 


x 8 
LPi, y) — Lp; y) S 7 


’ 


and therefore, by (3.1), we have 


n 


a 8 
L, — inf Len < -<8(1+1 ; 
bee RRE AEN 


t=1 
where we used $}; 1/t < 1+ f i dx/x = 1 + Inn. Observe that this performance bound 
is significantly smaller than the general bounds of the order of ,/n obtained in Chapter 2. 
Also, remarkably, the “size” of the class of experts is not reflected in the upper bound. 
The class of experts in the example considered is not only infinite but can be a ball in an 
infinite-dimensional vector space! 


Convex Losses, Constant Experts 

In the rest of this section we generalize the example of square loss described earlier by 
deriving general sufficient conditions for convex loss functions when the class of experts 
contains constant experts. For the simplicity of exposition we limit the discussion to the 
finite-dimensional case, but it is easy to generalize the results. 

Let D be a bounded convex subset of R“, and assume that an expert E is assigned to 
each element of D so that the loss of expert E at time t is €(E, y;). Using our convention 
of denoting vectors by bold letters, we write P, = (P11, ---, Pa,r) for the follow-the-best- 
expert forecaster, and similarly p; = (p;,,..., P4) for the corresponding hypothetical 
forecaster. We make the following assumptions on the loss function: 


1. £ is convex in its first argument and takes its values in [0, 1]; 

2. for each fixed y € V, £(-, y) is Lipschitz in its first argument, with constant B; 

3. for each fixed y € JY, &(-, y) is twice differentiable. Moreover, there exists a constant 
C > 0 such that for each fixed y € V, the Hessian matrix 


( 0° L(p, >) 
9PiOP| J axa 
is positive definite with eigenvalues bounded from below by C; 
4. forany y;,..., yı, the minimizer p* is such that V W, (p*ž) = 0, where, for each p € D, 


1 t 
WP = 7} &®, ys). 
g=] 
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Theorem 3.1. Under the assumptions described above, the per-round regret of the follow- 
the-best-expert forecaster satisfies 


1 (x 4B? (1+1 
= (2. = ane Len) < Bhliana) 
n EEE Cn 


Proof. First, by a Taylor series expansion of Y, around its minimum value p*, we have 


aa 
Y, ®,) — WD) = 5 ak (Pit — Pi Pit — Pia) 


(for some p € D) 


2 


’ 


ae, 
2 


where we used the fact that VY;,(p;) = 0 and the assumption on the Hessian of the loss 
function. On the other hand, 


V.(p,) — Ys (pF) 
= (Y) — VPD) + (YB) — Yr-1@,)) 
< (VP) LP) + (G) — Y-1®,)) 
(by the definition of p,) 


t—1 


1 . TA Lo , 
S 2 ee; » Ys) — L, Ys)) + UG, yr) — LŽ, y) 


2B yn 
= F IP; =P; |> 
where at the last step we used the Lipschitz property of £(-, y). Comparing the upper and 
lower bounds derived for Y,(P,) — W,(p*), we see that for every t = 1, 2, . 


4B 
Ip -el < G- 


Therefore, 
L, — inf Len <) CO») — eB}. y)) < be |p, -| < — >> - 
t=1 
as desired. W 


Theorem 3.1 establishes general conditions for the loss function under which the regret 
grows at the slow rate of Inn. Just as in the case of the square loss, the size of the 
class of experts does not appear explicitly in the upper bound. In particular, the upper 
bound is independent of the dimension. The third condition basically requires that the 
loss function have an approximately quadratic behavior around the minimum. The last 
condition is satisfied for many smooth strictly convex loss functions. For example, if 
V = D and (p, y) = ||p — yll* for some «œ € (1, 2], then all assumptions are easily seen 
to be satisfied. Other general conditions under which the follow-the-best-expert forecaster 
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achieves logarithmic regret have been thoroughly studied. Pointers to the literature are 
given at the end of the chapter. 


3.3 Exp-concave Loss Functions 


Now we return to the scenario of Section 2.1. Thus, we consider a finite class of N experts. 
In this section we introduce a class of loss functions that enjoys some useful properties when 
used in conjunction with the exponential potential ®,(u) = 7 In ( wa emi ). A loss func- 
tion £ is exp-concave for a certain n > 0 if the function F(z) = e~”@» is concave for all 
y € Y. Exp-concavity is a stronger property than convexity of £ in the first argument (see the 
exercises). The larger the value of 7, the more stringent the assumption of exp-concavity is. 
The following result shows a key property of exp-concave functions. Recall that the expo- 
nentially weighted average forecaster is defined by P, = eo Wit fit / ee W jt—b 
where w; p1 = e741, 


Theorem 3.2. If the loss function £ is exp-concave for n > 0, then the regret of the expo- 
nentially weighted average forecaster (used with the same value of n) satisfies, for all 
Yis- Yn € Y, ®,(R,) = ®,,(0). 


Proof. It suffices to show that the value of the potential function can never increase; that 
is, ®,(R;) < ®,(R;_1) or, equivalently, 


N N 
y eT "Lit e"it < y e™"Lin, 
i=1 i=l 

This may be rewritten as 


N 
«_ 1 Witt EXPl— £ its 
exp(—nl(pr, yr) = Deiat Wis = PNG y) (3.2) 
jai Wj 
But this follows from the definition of p, the concavity of F(z), and Jensen’s in- 
equality. W 


The fact that a forecaster is able to guarantee ©,(R,,) < ®,(0) immediately implies that 
his regret is bounded by a constant independently of the sequence length n. 


Proposition 3.1. If, for some loss function £ and for some n > 0, a forecaster satisfies 
®,(R,) < ©, (0) for all yy, ..., Yn € Y, then the regret of the forecaster is bounded by 


A : InN 
L,- min Lin < —. 
i=l, N n 


N 
~ 1 InN 
L,— min Lj, = max Rin < = In} | "in = 0,(R,) < ®,(0) = —. E 
i=1,..,N i=1,...N n 


j=1 


46 Tight Bounds for Specific Losses 


Figure 3.1. The relative entropy loss and the square loss plotted for D = VY = [0, 1]. 


Clearly, the larger the 7, the better is the bound guaranteed by Proposition 3.1. For each 
loss function £ there is a maximal value of 7 for which £ is exp-concave (if such an 7 exists 
at all). To optimize performance, the exponentially weighted average forecaster should be 
run using this largest value of 7. 

For the exponentially weighted average forecaster, combining Theorem 3.2 with Propo- 
sition 3.1 we get the regret bounded by (In N )/n for all exp-concave losses (where n may 
depend on the loss), a quantity which is independent of the length n of the sequence of out- 
comes. Note that, apart from the assumption of exp-concavity, we do not assume anything 
else about the loss function. In particular, we do not explicitly assume that the loss function 
is bounded. The examples that follow show that some simple and important loss functions 
are exp-concave (see the exercises for further examples). 


Relative Entropy Loss 

Let D = Y = [0, 1], and consider, the relative entropy loss £(p, y) = yIn(y/p) + (1 — 
y) In((1 —y)/a- D). With an easy calculation one can check that for n = 1 the function 
F(z) is concave for all values of y. Note that this is an unbounded loss function. A 
special case of this loss function, for Y = {0, 1} and D = [0, 1], is the logarithmic loss 
£(z, y) = —Iiy=1} In z — Iiy=0) In(1 — z). The logarithmic loss function plays a central role 
in several applications, and we devote a separate chapter to it (Chapter 9). 


Square Loss 

This loss function is defined by €(z, y) = (z — yÊ, where D = Y = [0, 1]. Straightforward 
calculation shows that F(z) is concave if and only if, for all z, (z — yy? < 1/(2n). This is 
clearly guaranteed if n < 1/2. 


Absolute Loss 
Let D = Y = [0, 1], and consider the absolute loss £(z, y) = |z — y|. It is easy to see that 
F(z) is not concave for any value of 7. In fact, as is shown in Section 3.7, for this loss 
function there is no forecaster whose cumulative excess loss can be bounded independently 
of n. 

Exp-concave losses also make it easy to prove regret bounds that hold for countably 
many experts. The only modification we need is that the exponential potential ®,(R) = 
Lin( ey e"Ri) be changed to ®,(R) = jin (So qe), where {qi : i =1,2...} is 
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any probability distribution over the set of positive integers. This ensures convergence of 
the series. Equivalently, g; represents the initial weight of expert 7 (as in Exercise 2.5). 


Corollary 3.1. Assume that (£, n) satisfy the assumption of Theorem 3.2. For any countable 
class of experts and for any probability distribution {qi : i = 1,2...} over the set of 
positive integers, the exponentially weighted average forecaster defined above satisfies, for 
alln > 1 and forall y1,..., Yn € Y, 


Proof. The proof is almost identical to that of Theorem 3.2; we merely need to redefine 
wji —1 in that proof as qe i, This allows us to conclude that ®,(R,,) < ®,(0). Hence, 
for alli > 1, 


[o0] 
qien®in < X qjenhin = e% (Ri) < e7?i 9 — 1, 
j=l 


Solving for R;,, yields the desired bound. E 


Thus, the cumulative loss of the forecaster exceeds the loss of each expert by at most 
a constant, but the constant depends on the expert. If we write the exponentially weighted 
forecaster for countably many experts in the form 


A Dici fiu exp (=n (Lit + : In +)) 
Pi = Sel a PEE , 
j=1 &XP N\Ljr-1 + 7 in 7 


then we see that the quantity ; in /qi) may be regarded as a “penalty” we add to the 
cumulative loss of expert i at each time f. Corollary 3.1 is a so-called “oracle inequality,” 
which states that the mixture forecaster achieves a cumulative loss matching the best 
penalized cumulative loss of the experts (see also Section 3.5). 


A Mixture Forecaster for Exp-concave Losses 

We close this section by showing how the exponentially weighted average predictor can be 
extended naturally to handle certain uncountable classes of experts. The class we consider is 
given by the convex hull of a finite number of “base” experts. Thus, the goal of the forecaster 
is to predict as well as the best convex combination of the base experts. The formal model 
is described as follows: consider N (base) experts whose predictions, at time t, are given by 
fir €D,i =1,...,N,t=1,...,n. We denote by f, the vector (fir, ..., fiv.r) of expert 
advice at time t. The decision space D is assumed to be a convex set, and we assume that the 
loss function £ is exp-concave for a certain value n > 0. Define the regret of a forecaster, 
with respect to the convex hull of the N base experts, by 


L, — inf Lan = X` LO, yO) — inf Y La f, yi), 
inf La, 3 (Bi, yi) iD (qf, yr) 


where Le is the cumulative loss of the forecaster, A denotes the simplex of N -vectors 
q= (q1, ..., qn), with q; = 0, A qi = 1, and q - f, denotes the element of D given by 
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the convex combination paar qi fi +- Finally, Lq,n denotes the cumulative loss of the expert 
associated with q. 

Next we analyze the regret (in the sense defined earlier) of the exponentially weighted 
average forecaster defined by the “mixture” 


A Wqt-19 -f dq 


i Sa Wqr—1dq 


’ 


{= 


where for each q € A, Wq,-1 = exp(—n e Lq- f;, Ys). Just as before, the value of 7 
used in the definition of the weights is such that the loss function is exp-concave. Thus, the 
forecaster calculates a weighted average over the whole simplex A, exponentially weighted 
by the past performance corresponding to each vector q of convex coefficients. At this 
point we are not concerned with computational issues, but in Section 9 we will see that the 
forecaster can be computed easily in certain special cases. The next theorem shows that the 
regret is bounded by a quantity of the order of N In(7/N). This is not always optimal; just 
consider the case of the square loss and constant experts studied in Section 3.2 for which we 
derived a bound independent of N . However, the bound shown here is much more general 
and can be seen to be tight in some cases; for example, for the logarithmic loss studied 
in Chapter 9. To keep the argument simple, we assume that the loss function is bounded, 
though this condition can be relaxed in some cases. 


Theorem 3.3. Assume that the loss function £ is exp-concave for n and it takes values in 
[0, 1]. Then the exponentially weighted mixture forecaster defined above satisfies 


Proof. Defining, for all q, the regret 


a 


Ran = Ln — Lan 


we may write R, for the function q +> Rq,n. Introducing the potential 


P,(R,) = f erRandg 
A 


we see, by mimicking the proof of Theorem 3.2, that ®,(R,,) < ®,(0) = 1/(N!). It remains 
to relate the excess cumulative loss to the value ®,(R,,) of the potential function. 
Denote by q* the vector in A for which 


Lan = inf L fe 
q*,n qeA qn 


Since the loss function is convex in its first argument (see Exercise 3.4), for any q’ € A and 
à € (0, 1), 


La-rg*+ag',n <(d- A)Lg*n + AL gn <(d- A)L gin + An, 


3.4 The Greedy Forecaster 49 


where we used the boundedness of the loss function. Therefore, for any fixed A € (0, 1), 


®,(Rn) = if ean dq 
A 


> ehn / ean dq 
{q:q=(1—A)q*+Aq/,q/€A} 


enn / e MUD Lae a tàn) dg 
{q:q=C—A)q*+Aq/,q'€A} 


(by the above inequality) 


IV 


= enn pM UAL gt n tan) dq. 
{q:q=(1—A)q*+Aq’,q/eA} 


The integral on the right-hand side is the volume of the simplex, scaled by 4 and centered at 
q*. Clearly, this equals A™ times the volume of A, that is, 4% /(N !). Using ®,(R,) < 1/(N!), 
and rearranging the obtained inequality, we get 


> ~ 1 
Ln — inf Lan < Ln = qd = A)L a,n <= Ina ™™ + An. 

qeA i n 
The minimal value on the right-hand side is achieved by A = N/nn, which yields the bound 
of the theorem. E 


3.4 The Greedy Forecaster 


In several arguments we have seen so far for analyzing the performance of weighted average 
forecasters, the key of the proof is bounding the increase of the value of the potential function 
®(R,) on the regret at time t with respect to the previous value ®(R,_ 1). In Theorem 2.1 we 
do this using the Blackwell condition. In some cases special properties of the loss function 
may be used to derive sharper bounds. For example, in Section 3.3 it is shown that for some 
loss functions the exponential potential in fact decreases at each step (see Theorem 3.2). 
Thus, one may be tempted to construct prediction strategies which, at each time instance, 
minimize the worst-case increase of the potential. The purpose of this section is to explore 
this possibility. 

The first idea one might have is to construct a forecaster P that, at each time instant t, 
predicts to minimize the worst-case regret, that is, 


D: = argmin sup max Rj, 

peD yey Î=b-N 

= argmin sup max (Risi + L(pr, Y) — El fit, yr)). 

peD yey i=1,...,N 
It is easy to see that the minimum in P; exists whenever the loss function £ is bounded and 
convex in its first argument. However, the minimum may not be unique. In such cases we 
may choose a minimizer by any pre-specified rule. Unfortunately, this strategy, known as 
fictitious play in the context of playing repeated games (see Chapter 7), fails to guarantee 
a vanishing per-round regret (see the exercises). 
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A next attempt may be to minimize, instead of max;<y Rj,;, a “smooth” version of the 
same such as the exponential potential 


N 
®,(R,) = Li (>: a . 
n i=l 
Note that if n max;<y Ris is large, then ®,(R;) ~ maxj<y Ris; but ®, is now a smooth 
function of its components. The quantity 7 is a kind of smoothing parameter. For large 
values of ņ the approximation is tighter, though the level curves of ®, become less smooth 
(see Figure 2.3). 
On the basis of the potential function ©, we may now introduce the forecaster P that, at 
every time instant, greedily minimizes the largest possible increase of the potential function 
for all possible outcomes y,. That is, 


PD, = argmin sup ®(R;_; + r;) (the greedy forecaster). 
peD wey 
Recall that the ith component of the regret vector r, is €(p, y,) — £( fit, yt). Note that, for 
the exponential potential, the previous condition is equivalent to 


N 
P: = argmin sup (ur y) + inom) ; 
peD yey no ii 
In what follows, we show that the greedy forecaster is well defined, and, in fact, has 
the same performance guarantees as the weighted average forecaster based on the same 
potential function. 

Assume that the potential function ® is convex. Then, because the supremum of convex 
functions is convex, SUP, cy ®(R,_; + r;) is a convex function of p if £ is convex in its 
first argument. Thus, the minimum over p exists, though it may not be unique. Once again, 
a minimizer may be chosen by any pre-specified rule. To compute the predictions p;, a 
convex function has to be minimized at each step. This is computationally feasible in many 
cases, though it is, in general, not as simple as computing, for example, the predictions of 
a weighted average forecaster. In some cases, however, the predictions may be given in a 
closed form. We provide some examples. 

The following obvious result helps analyze greedy forecasters. Better bounds for certain 
losses are derived in Section 3.5 (see Proposition 3.3). 


Theorem 3.4. Let ® : RY — R be a nonnegative, twice differentiable convex function. 
Assume that there exists a forecaster whose regret vector satisfies 
DR; < OR, 1) +c 
for any sequencer, ... , r’, of regret vectors and for anyt = 1, ... , n, where c, is a constant 
depending on t only. Then the regret R, of the greedy forecaster satisfies 
n 
PR,) < DO) + Yc. 


t=1 


Proof. It suffices to show that, for every t = 1,...,n, 


@(R,) < O(R;-1) + cr. 
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By the definition of the greedy forecaster, this is equivalent to saying that there exists a 
D: € D such that 


sup ® (R,;_; +r,) < (R1) +c, 

yey 
where r; is the vector with components (P+, yt) — €(fi.1, Yt), i = 1,..., N. The existence 
of such a P; is guaranteed by assumption. W 


The weighted average forecasters analyzed in Sections 2.1 and 3.3 all satisfy the con- 
dition of Theorem 3.4, and therefore the corresponding greedy forecasters inherit their 
properties proven in Theorems 2.1 and 3.2. Some simple examples of greedy forecasters 
using the exponential potential ®, follow. 


Absolute Loss 
Consider the absolute loss €(p, y) = |p — y| in the simple case when Y = {0, 1} and 
D = [0, 1]. Consider the greedy forecaster based on the exponential potential ®,. Since 
y, is binary valued, determining P, amounts to minimizing the maximum of two convex 
functions. After trivial simplifications we obtain 
N N 
P, = argmin max | X` eE- fis 0—List) Ss AGA ] 
pentyl i=l i=l 

(Recall that L;, = €(fi.1, Y1) + --- + €Cfi.t, yr) denotes the cumulative loss of expert i at 
time t.) To determine the minimum, just observe that the maximum of two convex functions 
achieves its minimum either at a point where the two functions are equal or at the minimum 
of one of the two functions. Thus, p either equals 0 or 1, or 


1 I yy einai D 


2 2n Day eLO) 


depending on which of the three values gives a smaller worst-case value of the potential 
function. Now it follows by Theorems 3.4 and 2.2 that the cumulative loss of this greedy 
forecaster is bounded as 


Square Loss 

Consider next the setup of the previous example with the only difference that the loss 
function now is €(p, y) = (P — y}. The calculations may be repeated the same way, and, 
interestingly, it turns out that the greedy forecaster P, has exactly the same form as before; 
that is, it equals either 0 or 1, or 


N —nhin-1—ne(fir1 
1 ee Se ee aa 


2 2 YiL eLan) 


depending on which of the three values gives a smaller worst-case value of the potential 
function. In Theorem 3.2 it is shown that special properties of the square loss imply that 
if the exponentially weighted average forecaster is used with n = 1/2, then the potential 
function cannot increase in any step. Theorem 3.2, combined with the previous result shows 
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that if 7 = 1/2, then the greedy forecaster satisfies 


(a min Lin <2InN. 
1,25. 


i= 


er 


Logarithmic Loss 

Again let V = {0,1} and D = [0,1], and consider the logarithmic loss (Ð, y) = 
—Iy=1; InP — Ip,=0); In. — P) and the exponential potential. It is interesting that in this 
case, for n = 1, the greedy forecaster coincides with the exponentially weighted average 
forecaster (see the exercises). 


3.5 The Aggregating Forecaster 


The analysis based on exp-concavity from Section 3.3 hinges on finding, for a given loss, 
some 7 > 0 such that the exponential potential ®,(R,,) remains bounded by the initial 
potential ®,(0) when the predictions are computed by the exponentially weighted average 
forecaster. If such an ņ is found, then Proposition 3.1 entails that the regret remains 
uniformly bounded by (In N)/ņ. However, as we show in this section, for all losses for 
which such an 7 exists, one might obtain an even better regret bound by finding a forecaster 
(not necessarily based on weighted averages) guaranteeing ®,(R,) < ®,(0) for a larger 
value of 7 than the largest ņ that the weighted average forecaster can afford. Thus, we 
look for a forecaster whose predictions p, satisfy ®,(R:-1 +r) < ®,(R;-1) irrespective 
of the choice of the next outcome y, € Y (recall that, as usual, r; = (71,7, ...,7y.2), where 
ris = CPi, Y) — £C fit ¥1))- It is easy to see that this is equivalent to the condition 


N 
Li, Ye) < E In (> into) for all y, € V. 
i=1 
The distribution ¢1,;-1, ... , Gnv,:—1 is defined via the weights associated with the exponential 
potential. That is, qi -1 = etua (Y3 eth), To allow the analysis of losses for 
which no forecaster is able to prevent the exponential potential from ever increasing, 
we somewhat relax the previous condition by replacing the factor 1/n with u(n)/n. The 
real-valued function jz is called the mixability curve for the loss £, and it is formally 
defined as follows. For all n > 0, u(n) is the infimum of all numbers c such that for all 


N, for all probability distributions (q1, ..., qn), and for all choices of the expert advice 
fi, ---, fw € D, there exists ap € D such that 
è N 
L, y) < ——In (>: wing for all y € Y. (3.3) 
n i=l 


Using the terminology introduced by Vovk, we call aggregating forecaster any forecaster 
that, when run with input parameter 7, predicts using P, which satisfies (3.3), with c = u(n). 

The mixability curve can be used to obtain a bound on the loss of the aggregating 
forecaster for all values n > 0. 


Proposition 3.2. Let u be the mixability curve for an arbitrary loss function £. Then, for 
all n > O, the aggregating forecaster achieves 


Ln < wn) min Lint Hm) InN 
i=l,...,.N n 


foralln > land for all y1, ..., Yn E V. 
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Proof. Let W, = y e7i«, Then, by definition of mixability, for each ż¢ there exists a 
D: € D such that 


iG. 9) < um) i DA eLis eTM fit Ir) an) z W, 
Pt: Ye) S N Sia = W . 
n Xai e n t-1 


Summing the leftmost and rightmost members of this inequality for t = 1,..., we get 


IA 
i 
= 
as 
5 
Z 


for any expert i = 1,..., N. E 


In the introduction to Chapter 2 we described the problem of prediction with expert 
advice as an iterated game between the forecaster and the environment. To play this game, 
the environment must use some strategy for choosing expert advice and outcome at each 
round based on the forecaster’s past predictions. Fix such a strategy for the environment 
and fix a strategy for the forecaster (e.g., the weighted average forecaster). For a pair 
(a, b) € RŽ, say that the forecaster wins if his strategy achieves 


L, <a min Lin +binN 
for alln, N > 1; otherwise the environment wins (suppose that the number N of experts is 
chosen by the environment at the beginning of the game). 

This game was introduced and analyzed through the notion of mixability curve in the 
pioneering work of Vovk [298]. Under mild conditions on D, V, and £, Vovk shows the 
following result: 


e Foreach pair (a, b) € R? the game is determined. That is, either there exists a forecasting 
strategy that wins irrespective of the strategy used by the environment or the environment 
has a strategy that defeats any forecasting strategy. 

e The forecaster wins exactly for those pairs (a, b) € RŽ such that there exists some n > 0 
satisfying u(n) < a and u(1n)/n < b. 


Vovk’s result shows that the mixability curve is exactly the boundary of the set of all pairs 
(a, b) such that the forecaster can always guarantee that Le <aminj<y Lin +blnN. It 
can be shown, under mild assumptions on £, that u > 1. The largest value of ņn for which 
u(n) = 1 is especially relevant to our regret minimization goal. Indeed, for this n we can 
get the strong regret bounds of the form (In N )/n. 

We call n-mixable any loss function for which there exists an 7 satisfying u(n) = 1 in 
the special case D = [0, 1] and VY = {0, 1} (the choice of 0 and 1 is arbitrary; the theorem 
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can be proven for any two real numbers a, b, with a < b). The next result characterizes 
mixable losses. 


Theorem 3.5 (Mixability Theorem). Let D = [0, 1], Y = {0, 1}, and choose a loss function 
£. Consider the set S C [0, 1/7 of all pairs (x, y) such that there exists some p € [0, 1] 
satisfying €(p,0) < x and £(p, 1) < y. For each n > 0, introduce the homeomorphism 
H, : (0, 1? > [e~”, 1]? defined by H,(x, y) = (e-™,e~”). Then £ is n-mixable if and 
only if the set H,(S) is convex. 


Proof. We need to find, for some n > 0, for all y € {0, 1}, for all probability distributions 
qi,---,qn, and for any expert advice fi,..., fy € [0, 1], a number p € [0, 1] satisfying 


N 
1 
Aedes in bs ing) l 


i=l 
This condition can be rewritten as 
N 
eTo) > Xo eq; 
i=1 
and, recalling that y € {0, 1}, as 
N N 
eTO > Se MU MG, and e~ MPD > Se MUi-Dg,, 
i=] i=l 
The condition says that a prediction p must exist so that each coordinate of 
H, (e( D,9), (p, 1)) is not smaller than the corresponding coordinate of the convex com- 
bination S H, (e( fi, 9), Gi, D)qi. If H„(S) is convex, then the convex combination 
belongs to H,,(S). Therefore such a p always exists by definition of S (see Figure 3.2). This 
condition is easily seen to be also necessary. Ml 


Remark 3.1. Assume that 9(p) = €(p, 0) and €;(p) = €(p, 1) are twice differentiable so 
that £9(0) = £1 (1) = 0, £(p) > 0, £)(p) < Oforall0 < p < 1. Then, foreach n > 0, there 
exists a twice differentiable function h, such that y,(p) = hy (x, ( p)) and, by Theorem 3.5, 
£ is n-mixable if and only if h, is concave. Using again the assumptions on £, the concavity 
of h, is equivalent to (see Exercise 3.11) 


Lo(pyei(p) — Cpe Cp) 
n< mn 7 : ; i . 
o<p<1 LACPE C(L) — £o(p)) 


Remark 3.2. We can extend the mixability theorem to provide a condition sufficient for 
mixability in the cases where D = Y = [0, 1]. To do this, we must verify that e774). > 

N e Og; and eM?) > YN e™™tUDg; together imply that e™™ P») > 
S eUi Y)q; for all 0 < y < 1. This implication is satisfied whenever 


i= 


N 
eT) x eig; (3.4) 


i=1 


is a concave function of y € [0, 1] for each fixed p € [0, 1], fj, and q;, i =1...,N. 
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H,(S) 


(a) (b) 


Figure 3.2. Mixability for the square loss. (a) Region S of points (x,y) satisfying (p — 
0)? <x and (p— 1} < y for some p € [0,1]. (b) Prediction p must correspond to a point 
Ay = H, (&p, 0), £(p, 1)) to the north-east of f = yG H (EC fi, 0), EC fi, 1))qi- Note that f is in 
the convex hull (the shaded region) whose vertices are in H,,(S). Therefore, if H,,(S) is convex, then 
f € H,(S), and finding such a p is always possible by definition of S. 


The next result proves a simple relationship between the aggregating forecaster and the 
greedy forecaster of Section 3.4. 


Proposition 3.3. For any n-mixable loss function £, the greedy forecaster using the expo- 
nential potential ®, is an aggregating forecaster. 


Proof. If a loss £ is mixable, then for each ¢ and for all y; € Y there exists P, € D 
satisfying 
pee eT "Lina iY) 


N —nh jy 
ee ee 


ye 1 
LPi, Ye) < g In 


Such a p, may then be defined by 


>. a] 
i= 


N —nL; 
NL j,t— 
ee n=l 


N 
1 
= argmin sup (w y) + D , 


peD yey i=l 


Pr = argmin sup | €(p, y) + —In 
peED yey n 


which is exactly the definition of the greedy forecaster using the exponential potential. Hl 


Oracle Inequality for Mixable Losses 

The aggregating forecaster may be extended, in a simple way, to handle countably infinite 
classes of experts. Consider a sequence fi, f2,...of experts such that, at time t, the 
prediction of expert f; is fi; € D. The goal of the forecaster is to predict as well as any of 
the experts f;. In order to do this, we assign, to each expert, a positive number z; > 0 such 
that }°°° , 2; = 1. The numbers x; may be called prior probabilities. 
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Assume now that the decision space D is a compact metric space and the loss function £ 
is continuous. Let u be the mixability curve of £. By the definition of u, for every y; € V, 
every sequence q1, q2, ... Of positive numbers with Si qi = 1, and N > 0, there exists a 
p™ € D such that 


KORIN mM, (Š 
PM, y) < — ea: saa = )< HNN? in (aen) 


i=l j=l fj 1 i=1 


Because D is compact, the sequence {p} has an accumulation point P € D. Moreover, 
by the continuity of £, this accumulation point satisfies 


CO 
LỌ, y) < -em In (È gD) ' 


i=l 
In other words, the aggregating forecaster, defined by p; satisfying 
u(n) a ey gr; ENE Fir I) 
ja pe a 


is well defined. Then, by the same argument as in Corollary 3.1, for all 7 > 0, we obtain 


LP ye) < 


Be 1 1 
Ly < u(n) 7 min (rat n F= In 2) 
forall n > 1 and for all yj,..., Yn E€ V. 
By writing the forecaster in the form 


o 1 1 
a OP =n | Lisi + — a = Ml fit» Ye) 
Li, y) < In 
n oo 1, 1 
2 P —n{Ljr-rt+— Ae 
J 


we see that the quantity + In(1/z;) may be regarded as a “penalty” added to the cumulative 
loss of expert i at each time t. The performance bound for the aggregating forecaster is a 
so-called “oracle inequality,’ which states that it achieves a cumulative loss matching the 
best penalized cumulative loss of the experts. 


3.6 Mixability for Certain Losses 


In this section, we examine the mixability properties of various loss functions of special 
importance. 


The Relative Entropy Loss Is 1-Mixable 
Recall that this loss is defined by 


]— 
Up, y)=yn% 4+ yin 7 where p, y € [0, 1]. 
p -p 


Let us first consider the special case when y e€ {0, 1} (logarithmic loss). It is easy to 
see that the conditions of the mixability theorem are satisfied with n = 1. Note that for 
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this choice of ņ, the unique prediction p satisfying definition (3.3) of the aggregating 
forecaster with c = 1 is exactly the prediction of the exponentially weighted average 
forecaster. Therefore, in the case of logarithmic loss, this approach does not offer any 
advantage with respect to the analysis based on exp-concavity from Section 3.3. Also 
recall from Section 3.4 that for n = 1 the exponentially weighted average forecaster is 
the unique greedy minimizer of the potential function. Thus, for the logarithmic loss, this 
predictor has various interesting properties. Indeed, in Chapter 9 we show that when the 
set of experts is finite, the exponentially weighted average forecaster is the best possible 
forecaster. 

Returning to the relative entropy loss, the validity of (3.3) for c = 1, n = 1, and P = 
pe fiqi i.e., when P is the weighted average prediction) is obtained directly from the 
exp-concavity of the function F(z) = e~”@-»), which we proved in Section 3.3. Therefore, 
the exponentially weighted average forecaster with n = 1 satisfies the conditions of the 
mixability theorem also for the relative entropy loss. Again, on the other hand, we do not 
get any improvement with respect to the analysis based on exp-concavity. 


The Square Loss Is 2-Mixable 

For the square loss €(p, y) = (p — y}, p, y € [0, 1], we begin again by assuming y € 
{0, 1}. Then the conditions of the mixability theorem are easily verified with n = 2. With a 
bit more work, we can also verify that function (3.4) is concave in y € [0, 1] (Exercise 3.12). 
Therefore, the conditions of the mixability theorem are satisfied also when D = Y = (0, 1]. 
Note that, unlike in the case of the relative entropy loss, here we gain a factor of 4 in the regret 
bound with respect to the analysis based on exp-concavity. This gain is real, because it can 
be shown that the conditions of the mixability theorem cannot be satisfied, in general, by the 
exponentially weighted average forecaster. Moreover, it can be shown that no prediction of 
the form P = gl yr fiqi) satisfies (3.3) with c = 1 and ņ = 2 no matter which function 
g is picked (Exercise 3.13). 

We now derive a closed-form expression for the prediction p, of the aggregating fore- 
caster. Since the square loss is mixable, by Proposition 3.3 the greedy forecaster is an 
aggregating forecaster. Recalling from Section 3.4 the prediction of the greedy forecaster 
for the square loss, we get 


0 ifr, <0 
Pr = Ft if 0 <r 1 
1 ifr, >l 


where 


1, 1. eter Go) 
i= 
2 2n Dar ew j1-1—n(Fj.1,0) 4 


The Absolute Loss Is Not Mixable 

The absolute loss, defined by (p, y) = |p — y| for p, y € [0, 1], does not satisfy the 
conditions of the mixability theorem. Hence, we cannot hope to find n > O such that u(n) = 
1. However, we can get a bound for the loss for all n > 0 by applying Proposition 3.2. To do 
this, we need to find the mixability function for the absolute loss, that is, the smallest function 
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u : R} — R+ such that, for all distributions q1, ..., qn, for all fi,..., fy € [0, 1], and 
for all outcomes y, there exists a P € [0, 1] satisfying 


H HO) a [S5 eis 
-yl s- De qi)- (3.5) 


i=l 


Here we only consider the simple case when y € {0, 1} (prediction of binary outcomes). In 
this case, the prediction p achieving mixability is expressible as a (nonlinear) function of 
the exponentially weighted average forecaster. In the more general case, when y € [0, 1], 
no prediction of the form p = el yei fiqi) achieves mixability no matter how function g 
is chosen (Exercise 3.14). 

Using the linear interpolation e~"* < 1 — (1 = e~") x, which holds for all n > 0 and 
0 <x < 1, we get 


u(n) Z 
EREL] nfiq; 
pe(ger a) 


i=1 


N 
= In ( Y- e™)if sa (3.6) 
i=1 
M) x 
In (: (1 e™") X fiq =y ) ; 
i=1 


where in the last step we used the assumption y € {0, 1}. Hence, using the notation r = 
N io . 
XL; fiqi, it is sufficient to prove that 


IV 


EM h(i (= e™]r — yl) 


or equivalently, using again the assumption y € {0, 1}, 


re AN aii (-e"-r)) <3< 


In(i—(—eyr). (8.7 
i E ASEE (3.7) 


By setting r = 1/2 in inequality (3.7), we get 


1+ AP n (A) « win (SE), (3.8) 
n 2 n 2 


which is satisfied only by the assignment 


n/2 


wn) = In(2/(1 + e7) 


Note that for this assignment, (3.8) holds with equality. Simple calculations show that the 
choice of u above satisfies (3.7) for all 0 < r < 1. Note further that for fi, ..., fy € {0, 1} 
the linear approximation (3.6) is tight. If in addition r = ae fiqi = 1/2, then there 
exists only one function jz such that (3.5) holds. Therefore, u is indeed the mixability 
curve for the absolute loss with binary outcomes. It can be proven (Exercise 3.15) that 
this function is also the mixability curve for the more general case when y € [0, 1]. In 
Figure 3.3 we show, as a function ofr , the upper and lower bounds (3.7) on the prediction p 
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0 


0 1 


Figure 3.3. The two dashed lines show, as a function of the weighted average r , the upper and lower 
bounds (3.7) on the mixability-achieving prediction P for the absolute loss when 7 = 2; the solid line 
in between is a mixability-achieving prediction p obtained by taking the average of the upper and 
lower bounds. Note that, in this case, p can be expressed as a (nonlinear) function of the weighted 
average. 


achieving mixability and the curve p = P(r) obtained by taking the average of the upper and 
lower bounds. Comparing the regret bound for the mixability-achieving forecaster obtained 
via Proposition 3.2 with the bound of Theorem 2.2 for the weighted average forecaster, 
one sees that the former has a better dependence on 7 than the latter. However, as we 
show in Section 3.7 the bound of Theorem 2.2 is asymptotically tight. Hence, the benefit 
of using the more complicated mixability-achieving forecaster vanishes as n grows to 
infinity. 

The relationships between the various losses examined so far are summarized in 
Figure 3.4. 


3.7 General Lower Bounds 


We now address the question of the tightness of the upper bounds obtained so far in this 
and the previous chapters. Our purpose here is to derive lower bounds for the worst-case 
regret. More precisely, we investigate the behavior of the minimax regret V^’. Recall from 
Section 2.10 that, given a loss function £, V,“" is defined as the regret of the best possible 
forecasting strategy for the worst possible choice of n outcomes and advice f; s for N 
experts, where i = 1,..., N andt =1,...,n. 

The upper bound of Theorem 2.2 shows that if the loss function is bounded between 
0 and 1 then V® < /(n/2)InN. On the other hand, the mixability theorem proven 
in this chapter shows that for any mixable loss £, the significantly tighter upper bound 
ya ) < celn N holds, where cy is a parameter that depends on the specific mixable loss. 
(See Remark 3.1 for an analytic characterization of this parameter.) In this section we 
show that, in some sense, both of these upper bounds are tight. The next result shows 
that, apart from trivial and uninteresting cases, the minimax loss is at least proportional to 
the logarithm of the number N of experts. The theorem provides a lower bound for any 
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_~ Exp-concave 


F ‘ ` Mixable 


# 
Bounded Convex 


Figure 3.4. A Venn diagram illustrating the hierarchy of losses examined so far. The minimax 
regret for bounded and convex losses is O(vn InN iF and this value is achieved by the weighted 
average forecaster. Mixable losses, which are not always bounded, have a minimax regret of the form 
cln N. This regret is not necessarily achieved by the weighted average forecaster. Exp-concave losses 
are those mixable losses for which the weighted average forecaster does guarantee a regret cln N. 
Such losses are properly included in the set of mixable losses in the following sense: for certain 
values of 7, some losses are y-mixable but not exp-concave (e.g., the square loss). The minimax 
regret for bounded losses that are not necessarily convex is studied in Chapter 4 using randomized 
forecasters. 


loss function. When applied to mixable losses, this lower bound captures the logarithmic 
dependence on N, but it does not provide a matching constant. 


bala 3.6. Fix any loss function £. Then V {ke n; Z Uogs N JV, for all N > 2 and 
alin > 1. 


Proof. Without loss of generality, assume that there are N = 2” experts for some M > 1. 


For any m < N/2, we say that the expert advice fi,,..., fy, at time t is m-coupled if 
fit = fm+i «t for alli = 1,...,m. Similarly, we say that the expert advice at time ¢ is 
m-simple if fi. =--- = fine and fm+1,t = -++ = fm. Note that these definitions impose 


no constraints on the advice of experts with index i > 2m. We break time in M stages of 
n time steps each. We say that the expert advice is m-simple (m-coupled) in stage s if it 
is m-simple (m-coupled) on each time step in the stage. We choose the expert advice so 
that 


1. the advice is 2°~!-simple for each stage s = 1,..., M; 
2. for each s = 1,..., M — 1, the advice is 2°-coupled in all time steps up to stage s 
included. 
Note that we can obtain such an advice simply by choosing, at each stage s = 1,..., M, 


an arbitrary 2°—'-simple expert advice for the first 2° experts, and then copying this advice 
onto the remaining experts (see Figure 3.5). 
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s= s=2 s= 
l---0------ QS Sso3 0.--- 
2--- 1 ------ 0------ 0.--- 
3--- 0 ------ jire 0.--- 
4--- 1 ------ jess Oxo 
5--- 0 ------ 0------ Jesas 
T E Qriasms Jass 
7--- 0 ------ Pearce dene 
8--- | ------ Leseees hess 


Figure 3.5. An assignment of expert advice achieving the lower bound of Theorem 3.6 for N = 8 
and n = 1. Note that the advice is 1-simple for s = 1, 2-simple for s = 2, and 4-simple for s = 3. In 
addition, the advice is also 2-coupled for s = 1 and 4-coupled for s = 1, 2. 


Consider an arbitrary forecasting strategy P. For any fixed sequence yj,..., yy on 
which P is run, and for any pair of stages 1 <r < s < M, let 


ns 


ROD= JO (Uey) — efit, Yd), 


t=n(r—1)+1 


where P; is the prediction computed by P at time t. 

For the sake of simplicity, assume that n = 1. Fix some arbitrary expert i and pick any 
expert j (possibly equal to i). If i, j < 2“~! ori, j > 2@~!, then Ri(y”) = RQ” + 
R j(ym) because the advice at s = M is 2“~'-simple. Otherwise, assume without loss of 
generality that i < 2“~! and j > 2”—!. Since the advice at s = 1,...,M — 1 is 2”—!- 
coupled, there exists k > 2”—! such that Rit) = RG). In addition, since the 
advice at t = M is 2“—'-simple, R j(yu) = ROm). We have thus shown that for any i and 
j there always exists k such that R; QY )= Ry 4R jOm). Repeating this argument, 
and using our recursive assumptions on the expert advice, we obtain 


Rii =P R00 


where ji,..., jw = j are arbitrary experts. This reasoning can be easily extended to the 
case n > 1, obtaining 


M 
Ri (yi™) = se Rj, (Hg) - 
s=l1 


Now note that the expert advice at each stage s = 1,..., M is 2°~'-simple, implying that 
we have a pool of at least two “uncommitted experts” at each time step. Hence, using the 
fact that the sequence y|,..., Yam is arbitrary and 


V&) = inf sup sup max 22 (Pi YÀ — U fit YD), 
P {F:|Fl= =N} yey" i=l DN 


where F are classes of static experts (see Section 2.10), we have, for each stage s, 


f 2 
Rj, (yr) > V® 
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for some choices of outcomes y,, expert index j,, and static expert advice. To conclude the 
proof note that, trivially, ve > RO”) E 


Using a more involved argument, it is possible to prove a parametric lower bound that 
asymptotically (for both N, n — oo) matches the upper bound ce In N , where ce is the best 
constant achieved by the mixability theorem. 

What can be said about ye ) for nonmixable losses? Consider, for example, the absolute 
loss £(p, y) = |p — y|. By Theorem 2.2, the exponentially weighted average forecaster has 
a regret bounded by ./(n/2) InN, which implies that, for all n and N, 


vn) 


JV(n/2) inv 


The next result shows that, in a sense, this bound cannot be improved further. It also shows 
that the exponentially weighted average forecaster is asymptotically optimal. 


Theorem 3.7. If Y = {0, 1}, D = [0, 1] and £ is the absolute loss €(p, y) = |p — y|, then 
Vin) 


YP Taan 


Proof. Clearly, V") > supz. jz) Vn(F), where we take the supremum over classes of 
N static experts (see Section 2.9 for the definition of a static expert). We start by lower 
bounding V,,(F) for a fixed F. Recall that the minimax regret V,,(F) for a fixed class of 
experts is defined as 


V.(F) = inf sup sup [D = yl lk Yil 
P yre sp sp : $ } il); 


where the infimum is taken over all forecasting strategies P. Introducing i.i.d. symmetric 


Bernoulli random variables Y,...,Y, (i.e., PIY, = 0] = P[Y; = 1] = 1/2), one clearly 
has 
V,(F) > inf E sup 2 [Pr -Yl — If — Yel) 
P JEF 
= inf >> IP, —Y,|-E inf DU lf: —Yil- 
t=1 t=1 
(In Chapter 8 we show that this actually holds with equality.) Since the sequence Y,,..., Yn 


is completely random, for all forecasting strategies one obviously has E }7"_, |p, — Y;| = 
n/2. Thus, 


Vi(F) 


IV 


P : n 
i- pie 
we (; =- |f- r)| 


r 
23 [eos TOL a|; 


FEF 1 
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where o, = 1 — 2Y, are i.i.d. Rademacher random variables (i.e., with P[o, = 1] = Plo, = 
—1] = 1/2). We lower bound sup.) 7) Vn(F) by taking an average over an appropriately 
chosen set of expert classes containing N experts. This may be done by replacing each 
expert f = (fi, .-., fn) by a sequence of symmetric i.i.d. Bernoulli random variables. 
More precisely, let {Z;,;} be an N x n array of i.i.d. Rademacher random variables with 
distribution P[Z;, = —1] = P[Z;,, = 1] = 1/2. Then 


sae gi) 
sup V,(F)> sup E| sup ye (5 — fi) O; 
F:|F|=N F:\Fl=n | fef \2 
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By the central limit theorem, for each i = 1,..., N,n7!/? X71 Zi converges to a standard 
normal random variable. In fact, it is not difficult to show (see Lemma A.11 in the Appendix 


for the details) that 
pays fate ft L a iia pic aad 


where G,..., Gy are independent standard normal random variables. But it is well known 
(see Lemma A.12 in the Appendix) that 


lim =———__—= = 1, 


N->0o V/2InN 


and this concludes the proof. W 


3.8 Bibliographic Remarks 


The follow-the-best-expert forecaster and its variants have been thoroughly studied in a 
somewhat more general framework, known as the sequential compound decision problem 
first put forward by Robbins [244]; see also Blackwell [28, 29], Gilliland [127], Gilliland 
and Hannan [128], Hannan [141], Hannan and Robbins [142], Merhav and Feder [213], 
van Ryzin [254], and Samuel [256,257]. In these papers general conditions may be found 
that guarantee that the per-round regret converges to 0. Lemma 3.1 is due to Hannan [141]. 
The example of the square loss with constant experts has been studied by Takimoto and 
Warmuth [284], who show that the minimax regret is Inn — InInn + o(1). 

Exp-concave loss functions were studied by Kivinen and Warmuth [182]. Theorem 3.3 
is a generalization of an argument of Blum and Kalai [33]. The mixability curve of Sec- 
tion 3.5 was introduced by Vovk [298,300], and also studied by Haussler, Kivinen, and 
Warmuth [151], who characterize the loss functions for which ©(ln NV) regret bounds are 
possible. The characterization of the optimal 7 for mixable functions given in Remark 3.1 
is due to Haussler, Kivinen, and Warmuth [151]. The examples of mixability are taken 
from Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire, and Warmuth [48], Haussler, 
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Figure 3.6. The Hellinger loss. 


Kivinen, and Warmuth [151], and Vovk [298, 300]. The mixability curve of Section 3.5 
is equivalent to the notion of “extended stochastic complexity” introduced by Yaman- 
ishi [313]. This notion generalizes Rissanen’s stochastic complexity [240,242]. 

In the case of binary outcomes Y = {0, 1}, mixability of a loss function £ is equivalent 
to the existence of the predictive complexity for £, as shown by Kalnishkan, Vovk and 
Vyugin [178], who also provide an analytical characterization of mixability in the binary 
case. The predictive complexity for the logarithmic loss is equivalent to Levin’s version of 
Kolmogorov complexity, see Zvonkin and Levin [320]. In this sense, predictive complexity 
may be viewed as a generalization of Kolmogorov complexity. 

Theorem 3.6 is due to Haussler, Kivinen, and Warmuth [151]. In the same paper, they 
also prove a lower bound that, asymptotically for both N, n — oo, matches the upper 
bound ce In N for any mixable loss £. This result was initially proven by Vovk [298] with a 
more involved analysis. Theorem 3.7 is due to Cesa-Bianchi, Freund, Haussler, Helmbold, 
Schapire, and Warmuth [48]. Similar results for other loss functions appear in Haussler, 
Kivinen, and Warmuth [151]. More general lower bounds for the absolute loss were proved 
by Cesa-Bianchi and Lugosi [51]. 


3.9 Exercises 


3.1 Consider the follow-the-best-expert forecaster studied in Section 3.2. Assume that D = yY is a 
convex subset of a topological vector space. Assume that the sequence of outcomes is such that 
limp oo 1 XL X: = y for some y € Y. Establish weak general conditions that guarantee that 
the per-round regret of the forecaster converges to 0. 


3.2 Let Y = D = [0, 1], and consider the Hellinger loss 


te D= 5 (E-A + (VT -V13 ) 


(see Figure 3.6). Determine the values of 7 for which the function F (z) defined in Section 3.3 
is concave. 


3.3 


3.4 


3.5 


3.6 


3.7 


3.8 


3.9 


3.10 


3.11 


3.12 


3.13 
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Let VY = D be the closed ball of radius r > 0, centered at the origin in R4, and consider the 
loss £(z, y) = ||z — y||? (where ||- || denotes the euclidean norm in R“). Show that F(z) of 
Theorem 3.2 is concave whenever 7 < 1/(8r7). 


Show that if for a y € Y the function F(z) = e~"©» is concave, then £(z, y) is a convex 
function of z. 

Show that if a loss function is exp-concave for a certain value of n > 0, then it is also exp-concave 
for any 7’ € (0, n). 

In classification problems, various versions of the so-called hinge loss have been used. Typically, 
YX = {-, 1}, D = [-1, 1], and the loss function has the form £(p, y) = c(— yp), where c is a 
nonnegative, increasing, and convex cost function. Derive conditions for c and the parameter 7 
under which the hinge loss is exp-concave. 


(Discounted regret for exp-concave losses) Consider the discounted regret 
Pin = X Bra (eDi Y) — € fies Y) 


t=1 


defined in Section 2.11, where 1 = fo > fı > ---is a nonincreasing sequence of discount 
factors. Assume that the loss function £ is exp-concave for some 7 > 0, and consider the 
discounted exponentially weighted average forecaster 


Di fir exp (=n DIE Bevis 99d) 
RES N i-1 $ 
Diep (AE bt) 
Show that the average discounted regret is bounded by 


er Bn it < InN 
i=bN iy bna 1 pat Bra 


In particular, show that the average discounted regret is o(1) if and only if $o B, = œ. 


Show that the greedy forecaster based on “fictitious play” defined at the beginning of Section 3.4 
does not guarantee that n~! max;<y Ri n converge to 0 as n —> ov for all outcome sequences. 
Hint: It suffices to consider the simple example when Y = D = [0, 1], €(p, y) = [P — y|, and 
N =2. 

Show that for Y = {0, 1}, D = [0, 1] and the logarithmic loss function, the greedy forecaster 
based on the exponential potential with n = 1 is just the exponentially weighted average fore- 
caster (also with n = 1). 


Show that u(n) > 1 for all loss functions £ on D x yY satisfying the following conditions: (1) 
there exists p € D such that £(p, y) < oo for all y € V; (2) there exists no p € D such that 
(p, y) = 0 for all y € Y (Vovk [298]). 

Show the following: Let C C R? be a curve with parametric equations x = x(t) and y = y(t), 
which are twice differentiable functions. If there exists a twice differentiable function h such 
that y(t) = h(x(t)) for t in some open interval, then 


dy 2 ad dy 
dy — 4 ond dy _ dt dx 
dx 4& dx? ax * 

dt dt 


Check that for the square loss (p, y) = (p — y)’, p, y € [0, 1], function (3.4) is concave in y 
if n < 1/2 (Vovk [300]). 

Prove that for the square loss there is no function g such that the prediction p = g( D Si qi) 
satisfies (3.3) with c = 1. Hint: Consider N = 2 and find fi, f2,q1, q2 and fi, f,.9).% 
such that fig: + foq2 = fig; + f2q} but plugging these values in (3.3) yields a contradiction 
(Haussler, Kivinen, and Warmuth [151]). 
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3.14 Prove that for the absolute loss there is no function g for which the prediction P = g( yo Si qi) 
satisfies (3.5), where u(n) is the mixability curve for the absolute loss. Hint: Set n = 1 and 
follow the hint in Exercise 3.13 (Haussler, Kivinen, and Warmuth [151]). 

3.15 Prove that the mixability curve for the absolute loss with binary outcomes is also the mixability 
function for the case when the outcome space is [0, 1] (Haussler, Kivinen, and Warmuth [151]). 
Warning: This exercise is not easy. 

3.16 Find the mixability curve for the following loss: D is the probability simplex in R”, Y = [0, 1], 
and £(p, y) = P - y (Vovk [298]). Warning: This exercise is not easy. 
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Randomized Prediction 


4.1 Introduction 


The results of Chapter 2 crucially build on the assumption that the loss function £ is convex 
in its first argument. While this assumption is natural in many applications, it is not satisfied 
in some important examples. 

One such example is the case when both the decision space and the outcome space 
are {0, 1} and the loss function is €(p, y) = Tipy). In this case it is clear that for any 
deterministic strategy of the forecaster, there exists an outcome sequence y” = (y1,..., Yn) 
such that the forecaster errs at every single time instant, that is, Ln = n. Thus, except for 
trivial cases, it is impossible to achieve L, — min;=1,...,N Li n = O(n) uniformly over all 
outcome sequences. For example, if N = 2 and the two experts’ predictions are fi, = 0 
and fz, = 1 for all t, then min;=1,2 L;,, < n/2 , and no matter what the forecaster does, 


max (Z.0" - min, Lin") >>. 


yr ey" i=1,..., 


NIS 


One of the key ideas of the theory of prediction of individual sequences is that in such a 
situation randomization may help. Next we describe a version of the sequential prediction 
problem in which the forecaster has access, at each time instant, to an independent random 
variable uniformly distributed on the interval [0, 1]. The scenario is conveniently described 
in the framework of playing repeated games. 

Consider a game between a player (the forecaster) and the environment. At each round 
of the game, the player chooses an action? € {1,..., N }, and the environment chooses an 
action y € Y (in analogy with Chapters 2 and 3 we also call “outcome” the adversary’s 
action). The player’s loss £(i, y) at time ¢ is the value of a loss function £ : {1,..., N} x 
V — [0, 1]. Now suppose that, at the tth round of the game, the player chooses a probability 
distribution p, = (P1,1,--., Pw.) over the set of actions and plays action 7 with probability 
Pi +. Formally, the player’s action /, at time t is defined by 


i-l i 
I, =i ifandonlyif U, €| pje XO pj 
so that 
PU, =i | Ui, ..., U1] = Pist, i=1,...,N, 
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where U1, U2, ... are independent random variables, uniformly distributed in the interval 
[0, 1]. Adopting a game-theoretic terminology, we often call an action a pure strategy and 
a probability distribution over actions a mixed strategy. Suppose the environment chooses 
action y; € V. The loss of the player is then €(/;, y+). 

If J, h, ... is the sequence of forecaster’s actions and y1, yo,...is the sequence of 
outcomes chosen by the environment, the player’s goal is to minimize the cumulative regret 


ae min, Lin = Yee) min, Dm. yt) 


that is, the realized difference between the cumulative loss and the loss of the best pure 
strategy. 


Remark 4.1 (Experts vs. actions). Note that, unlike for the game of prediction with expert 
advice analyzed in Chapter 2, here we define the regret in terms of the forecaster’s best 
constant prediction rather than in terms of the best expert. Hence, as experts can be arbitrarily 
complex prediction strategies, this game appears to be easier for the forecaster than the 
game of Chapter 2. However, as our forecasting algorithms make assumptions neither on 
the structure of the outcome space yY nor on the structure of the loss function £, we can 
prove for the expert model the same bounds proven in this chapter. This is done via a 
simple reduction in which we use {1,..., N } to index the set of experts and define the loss 
LCi, yr) = LC fit, Yt), Where £ is the loss function used to score predictions in the expert 
game. 


Before moving on, we point out an important subtlety brought in by allowing the forecaster 
to randomize. The actions y; of the environment are now random variables Y, as they may 
depend on the past (randomized) plays 71, ..., /;-1 of the forecaster. One may even allow 
the environment to introduce an independent randomization to determine the outcomes Y,, 
but this is irrelevant for the material of this section; so, for simplicity, we exclude this 
possibility. We explicitly deal with such situations in Chapter 7. 

A slightly less powerful model is obtained if one does not allow the actions of the 
environment to depend on the predictions /; of the forecaster. In this model, which we 
call the oblivious opponent, the whole sequence y1, y2,...of outcomes is determined 
before the game starts. Then, at time t, the forecaster makes its (randomized) prediction /; 
and the environment reveals the tth outcome y,. Thus, in this model, the y,’s are nonrandom 
fixed values. The model of an oblivious opponent is realistic whenever it is reasonable to 
believe that the actions of the forecaster do not have an effect on future outcomes of the 
sequence to be predicted. This is the case in many applications, such as weather forecasting 
or predicting a sequence of bits of a speech signal for encoding purposes. However, there 
are important cases when one cannot reasonably assume that the opponent is oblivious. 
The main example is when a player of a game predicts the other players’ next move and 
bases his action on such a prediction. In such cases the other players’ future actions may 
depend on the action (and therefore on the forecast) of the player in any complicated way. 
The stock market comes to mind as an obvious example. Such game-theoretic applications 
are discussed in Chapter 7. 

Formally, an oblivious opponent is defined by a fixed sequence yj, y2,... of outcomes, 
whereas a nonoblivious opponent is defined by a sequence of functions g1, 82, ..., with 
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See lys wh, N}‘~! — J, and each outcome Y, is given by Y, = g;(,..., /;-1). Thus Y; is 
measurable with respect to the o-algebra generated by the random variables U,,..., U;_1. 

Interestingly, any forecaster that is guaranteed to work well against an oblivious opponent 
also works well in the general model against any strategy of a nonoblivious opponent, in a 
certain sense. To formulate this fact, we introduce the expected loss 


N 
= def 7. ` 7 
&p,, Y) = EEH, Y) |U,- U1] = E 2, Y) = > 2G, YD pis 


i=l 


Here the “expected value” E, is taken only with respect to the random variable /;. More 
precisely, Lp, Y,) is the conditional expectation of 2(/,, Y,) given the past plays Z1, ..., [;—1 
(we use E, to denote this conditional expectation). Since Y, = g;(1,..., [;-1), the value 
of £(p,, Y,) is determined solely by U;,..., U;-1. 

An important property of the forecasters investigated in this chapter is that the probability 
distribution p, of play is fully determined by the past outcomes Y4, ..., Y,—1; that is, it does 
not explicitly depend on the past plays /;,..., /;-1. Formally, p, : Y’~! > D is a function 
taking values in the simplex D of probability distributions over N actions. 

Note that the next result, which assumes that the forecaster is of the above form, is in 
terms of the cumulative expected loss $ 7—4 (p,, Y;), and not in terms of its expected value 


bs Ep, ro =E b eh, ro ; 
t=1 t=1 


Exercise 4.1 describes a closely related result about the expected value of the cumulative 
losses. 


Lemma 4.1. Let B,C be positive constants. Consider a randomized forecaster such that, 
for every t =1,...,n, p, is fully determined by the past outcomes y,,..., y;—1. Assume 
the forecaster’s expected regret against an oblivious opponent satisfies 


sup [Enne Hem] =s, i=1,...,N. 
t=1 t=1 


yey" 


If the same forecaster is used against a nonoblivious opponent, then 
n n 
Xip, Y) -CY ea, ¥,) <B, i=1,...,N 
t=1 t=1 


holds. Moreover, for all 6 > 0, with probability at least 1 — 6 the actual cumulative loss 
satisfies, for any (nonoblivious) strategy of the opponent, 


5 UL, Y) —C min 5 AAE E A 
; i=l,...,N , z 2 8 
t=1 t=1 
Proof. Observe first that if the opponent is oblivious, that is, the sequence y;,..., Yn iS 


fixed, then p, is also fixed and thus for each i = 1,..., N, 


2 (L, y) -C9 eG, J =) fm, ¥)-C > i, y, 
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which is bounded by B by assumption. If the opponent is nonoblivious, then for each 
i= aN, 


> 2@,.%)-C De, Yo 
t=1 t=1 


< sup Ey[€(1, y) — Ce, yD) +--+ + sup E,,[€Un, yn) — CLG, yn) 


yey YnEY 
= sup ( U [1 y) — CLG, yD] ++ n [Cn Yn) — CG, Yd]: 
yrey" 


To see why the last equality holds, consider n = 2. Since the quantity E\[¢@(1, y1) — 
C£(, y1)] does not depend on the value of y2, and since y2, owing to our assumptions on 
p2, does not depend on the realization of 71, we have 


sup Ei [EC y1) — CQ, y1)] + sup E2[€U2, y2) — CLl, y2)] 
yiey yey 


= sup sup ( nf, y1) — CLG, y1)] + En[ Ua, y2) — CLG, y2)]). 
Y1EY y2EVY 


The proof for n > 2 follows by repeating the same argument. By assumption, 


Paanan 


sup (E,[é(h, y) — C&G, yD] + + n [Ens yn) = CLC, n)]) <B, 


y” ey” 


which concludes the proof of the first statement of the lemma. 

To prove the second statement just observe that the random variables (1, Y,)— 
L(p,. Y,), t = 1,2,...form a sequence of bounded martingale differences (with respect 
to the sequence U,, U2, ...of randomizing variables). Thus, a simple application of the 
Hoeffding—Azuma inequality (see Lemma A.7 in the Appendix) yields that for every 
ô > 0, with probability at least 1 — ô, 


n Ie. n 1 
LU, Y;) < L(p,, Y, — ln —. 
2 (L ds), (P; D+ ne 


Combine this inequality with the first statement to complete the proof. E 


Most of the results presented in this and the next two chapters (with the exception of those 
of Section 6.10) are valid in the general model of the nonoblivious opponent. So, unless 
otherwise specified, we allow the actions of environment to depend on the randomized 
predictions of the forecaster. 

In Section 4.2 we show that the techniques of Chapter 2 may be used in the setup of 
randomized prediction as well. In particular, a simple adaptation of the weighted average 
forecaster guarantees that the actual regret becomes negligible as n grows. This property is 
known as Hannan consistency. 

In Section 4.3 we describe and analyze a randomized forecaster suggested by Hannan. 
This forecaster adds a small random “noise” to the observed cumulative losses of all 
strategies and selects the one achieving the minimal value. We show that this simple 
method achieves a regret comparable to that of weighted average predictors. In particular, 
the forecaster is Hannan consistent. 

In Section 4.4, a refined notion of regret, the so-called internal regret, is introduced. It 
is shown that even though control of internal regret is more difficult than achieving Hannan 
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consistency, the results of Chapter 2 may be used to construct prediction strategies with 
small internal regret. 

An interesting application of no-internal-regret prediction is described in Section 4.5. 
The forecasters considered there predict a {0, 1}-valued sequence with numbers between 
[0, 1]. Each number is interpreted as the predicted “chance” of the next outcome being 1. 
A forecaster is well calibrated if at those time instances in which a certain percentage is 
predicted, the fraction of 1’s indeed turns out to be the predicted value. The main result 
of the section is the description of a randomized forecaster that is well calibrated for all 
possible sequences of outcomes. 

In Section 4.6 we introduce a general notion of regret that contains both internal and 
external regret as special cases, as well as a number of other applications. 

The chapter is concluded by describing, in Section 4.7, a significantly stronger notion of 
calibration, by introducing checking rules. The existence of a calibrated forecaster, in this 
stronger sense, follows by a simple application of the general setup of Section 4.6. 


4.2 Weighted Average Forecasters 


In order to understand the randomized prediction problem described above, we first consider 
a simplified version in which we only consider the expected loss E,€(,, Y;) = L(p,. Y,). If 
the player’s goal is to minimize the difference between the cumulative expected loss 


> def 7 = 
Ln = ) L(p,, Y,) 
t=1 


and the loss of the best pure strategy, then the problem becomes a special case of the 
problem of prediction using expert advice described in Chapter 2. To see this, define the 
decision space D of the forecaster (player) as the simplex of probability distributions in R”. 
Then the instantaneous regret for the player is the vector r; € R", whose ith component is 


Tit z Lp, Y,) =, Li, Y,). 


This quantity measures the expected change in the player’s loss if it were to deterministically 
choose action 7 and the environment did not change its action. Observe that the “expected” 
loss function is linear (and therefore convex) in its first argument, and Lemma 2.1 may be 
applied in this situation. To this end, we recall the weighted average forecaster of Section 2.1 
based on the potential function 


N 
O(u) = (> ow) 
i=l 


where ¢ : R > R is any nonnegative, increasing, and twice differentiable function and 
w : R —> R is any nonnegative, strictly increasing, concave, and twice differentiable aux- 
iliary function (which only plays a role in the analysis but not in the definition of the 
forecasting strategy). The general potential-based weighted average strategy p, is now 


O VOR- (Rit) 
Di VOR Dea ORe) 
fort > l and p;ı = 1/N fori = 1,..., N. 
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By Lemma 2.1, for any y; € V, the Blackwell condition holds, that is, 
r,- V®(R;_1) < 0. (4.1) 


In fact, by the linearity of the loss function, it is immediate to see that inequality (4.1) is 
satisfied with equality. Thus, because the loss function ranges in [0, 1], then choosing (x) = 
x K (with p > 2)or (x) = e”, the respective performance bounds of Corollaries 2.1 and 2.2 
hold in this new setup as well. Also, Theorem 2.2 may be applied, which we recall here for 
completeness: consider the exponentially weighted average player given by 


exp (—n ae eG, Y;)) 


Pit = tc eee No 
Dies exp (= Dini tk, Ys) 
where n > 0. Then 
= i InN nn 
L,- min Lin < — + —. 
i=l, N n 8 


With the choice n = y8 InN /n the upper bound becomes ./(1 In N)/2. We remark here 
that this bound is essentially unimprovable in this generality. Indeed, in Section 3.7 we 
prove the following. 


Corollary 4.1. Let n, N > 1. There exists a loss function £ such that for any randomized 
forecasting strategy, 


= f nln N 
sup (L,— min Lin) = (1-— enn) ; 
yneyn i=1,....N 2 
where limy oo My EN,n = O. 


Thus the expected cumulative loss of randomized forecasters may be effectively bounded by 
the results of Chapter 2. On the other hand, it is more interesting to study the behavior of the 
actual (random) cumulative loss €(/;, Y1) +---+€Un, Yn). As we have already observed 
in the proof of Lemma 4.1, the random variables €(U/;, Y;) — Lp, Y,), fort = 1,2,..., form 
a sequence of bounded martingale differences, and a simple application of the Hoeffding— 
Azuma inequality yields the following. 


Corollary 4.2. Letn, N > 1 andô € (0, 1). The exponentially weighted average forecaster 
with n = ./8In N /n satisfies, with probability at least 1 — ô, 


Sa Y,)— min S E 
= Se eaten an eUS 2 Ca ye 


The relationship between the forecaster’s cumulative loss and the cumulative loss of the 
best possible action appears in the classical notion of Hannan consistency, first established 
by Hannan [141]. A forecaster is said to be Hannan consistent if 


n>% NMN \ Sha, 


1 n n 
lim sup (x U, Y) — min | > Li, ro) =0, with probability 1, 
t=1 E t=1 


where “probability” is understood with respect to the randomization of the forecaster. A 
simple modification of Corollary 4.2 leads to the following. 
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Corollary 4.3. Consider the exponentially weighted average forecaster defined above. If 
the parameter n = n; is chosen by n, = ~ (8 ln N )/t (as in Section 2.8), then the forecaster 
is Hannan consistent. 


The simple proof is left as an exercise. In the exercises some other modifications are 
described as well. 


In the definition of Hannan consistency the number of pure strategies N is kept fixed while 
we let the number n of rounds of the game grow to infinity. However, in some cases it is 
more reasonable to set up the asymptotic problem by allowing the number of comparison 
strategies to grow with n. Of course, this may be done in many different ways. Next 
we briefly describe such a model. The nonasymptotic nature of the performance bounds 
described above may be directly used to address this general problem. Let £ : {1,..., N} x 
V — [0, 1] be a loss function, and consider the forecasting game defined above. A dynamic 
strategy s is simply a sequence of indices s1, 52,..., where s; € {1,..., N} marks the 
guess a forecaster predicting according to this strategy makes at time ¢. Given a class S 
of dynamic strategies, the forecaster’s goal is to achieve a cumulative loss not much larger 
than the best of the strategies in S regardless of the outcome sequence, that is, to keep the 
difference 


2A Y,)- mig ) n Y,) 
t= t= 


as small as possible. The weighted average strategy described earlier may be generalized 
to this situation in a straightforward way, and repeating the same proof we thus obtain the 
following. 


Theorem 4.1. Let S be an arbitrary class of dynamic strategies and, for each t > 1, denote 
by N, the number of different vectors s = (s1, ..., S1) € {1,..., N¥, s € S. For any n > 0 
there exists a randomized forecaster such that for all 5 € (0, 1), with probability at least 
1-6, 

= i InN, n n 1 

L(I, Y;) — mi £(5,,¥;) < In +. 

D ) ne GrY)s— A+ ety ain: 


Moreover, there exists a strategy such that if n~' In N, > O as n —> œ, then 
lx 1 
-X E, Y) — = min) es, Y) > 0, with probability 1. 
fare M see T 


The details of the proof are left as an easy exercise. As an example, consider the following 
class of dynamic strategies. Let kı, k2, . . . be a monotonically increasing sequence of posi- 
tive integers such that k, < n for all n > 1. Let S contain all dynamic strategies such that 
each strategy changes actions at most k,, — 1 times between time 1 and n, for all n. In other 
words, each s € S is such that the sequence s1, ..., Sn consists of k, constant segments. It 
is easy to see that, for each n, N, = a CNN — 1)*~! Indeed, for each k, there are 
(0) different ways of dividing the time segment 1, ..., n into k pieces, and for a division 
with segment lengths n;,..., ng, there are N(N — 1)‘~! different ways of assigning actions 
to the segments such that no two consecutive segments have the same actions assigned. 
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Thus, Theorem 4.1 implies that, whenever the sequence {k,,} is such that k, Inn = o(n), we 
have 


jee 1 £ 
= XO, ioe min ) (s, Y,) > 0, with probability 1. 
n = n seS a 


Of course, calculating the weighted average forecaster for a large class of dynamic strate- 
gies may be computationally prohibitive. However, in many important cases (such as the 
concrete example given above), efficient algorithms exist. Several examples are described 
in Chapter 5. 

Corollary 4.2 is based on the Hoeffding—Azuma inequality, which asserts that, with 
probability at least 1 — ô, 


n He n 1 
(L,Y) — L(p,, Y) <,/—In-. 
2M; ,) 24, d<y55 


This inequality is combined with the fact that the cumulative expected regret is bounded as 
"1 Up, Yp) — minja1,..w Yo, CG, Yr) < /(/2) In N. The term bounding the random 
fluctuations is thus about the same order of magnitude as the bound for the expected regret. 
However, it is pointed out in Section 2.4 that if the cumulative loss min;=1,...,N Ss L(i, Y;) 
is guaranteed to be bounded by a quantity Lž <n, then significantly improved per- 
formance bounds, of the form ,/2L* InN +InN, may be achieved. In this case the 
term ~y (n/2)ln(1/ô) resulting from the Hoeffding—Azuma inequality becomes dominat- 
ing. However, in such a situation improved bounds may be established for the difference 
between the “actual” and “expected” cumulative losses. This may be done by invok- 
ing Bernstein’s inequality for martingales (see Lemma A.8 in the Appendix). Indeed, 
because 
: = 2 s = 2 
z, [EC Yi) — Ep, YOS = EL, YD? — (Ep, Yo) 
< El, YD? < EU, Yi) = EP, Yo), 


the sum of the conditional variances is bounded by the expected cumulative loss 
Ln = X ;_ €@,, Y+). Then by Lemma A.8, with probability at least 1 — ô, 


k es [__ 1 X32 1 
LU, Y) — L(p,, Yi) < ,/2L, In = + ——In-. 
2 (L, Yr) 2t, 1) < ngt 3 ns 


If the cumulative loss of the best action is bounded by a number L* known in advance, 
then L, can be bounded by L* + ,\/2L* InN + InN (see Corollary 2.4), and the effect of 
the random fluctuations is comparable with the bound for the expected regret. 


4.3 Follow the Perturbed Leader 


One may wonder whether “following the leader,’ that is, predicting at time f according 
to the action 7 whose cumulative loss L;,;-; up to that time is minimal, is a reasonable 
algorithm. In Section 3.2 we saw that under certain conditions for the loss function, this 
is quite a powerful strategy. However, it is easy to see that in the setup of this chapter 
this strategy, also known as fictitious play, does not achieve Hannan consistency. To see 
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this, just consider N = 2 actions such that the sequence of losses £(1, y+) of the first action 
is (1/2, 0, 1,0, 1,0, 1,...) while the values of £(2, y,) are (1/2, 1,0, 1,0, 1, 0,...). Then 
Lin is about n/2 for both į = 1 andi = 2, but fictitious play suffers a loss close to n. (See 
also Exercises 3.8 and 4.6.) However, as it was pointed out by Hannan [141], a simple 
modification suffices to achieve a significantly improved performance. One merely has 
to add a small random perturbation to the cumulative losses and follow the “perturbed” 
leader. 

Formally, let Z1, Z2, ... be independent, identically distributed random N -vectors with 
components Z; (i = 1, ..., N). For simplicity, assume that Z, has an absolutely continuous 
distribution (with respect to the N -dimensional Lebesgue measure) and denote its density 
by f. The follow-the-perturbed-leader forecaster selects, at time t, an action 


I= argmin (L it-1 + Zit) : 
N 


(Ties may be broken, say, in favor of the smallest index.) 

Unlike in the case of weighted average predictors, the definition of the forecaster does 
not explicitly specify the probabilities p;,, of selecting action i at time t. However, it is clear 
from the definition that given the past sequence of plays and outcomes, the value of pi, 
only depends on the joint distribution of the random variables Z; , but not on their random 
values. Therefore, by Lemma 4.1 it suffices to consider the model of oblivious opponent 
and derive bounds for the expected regret in that case. 

To state the main result of this section, observe that the loss €(U/;, y,) suffered at time t 
is a function of the random vector Z, and denote this function by F,(Z;) = €(;, y+). The 
precise definition of F, : RY — [0, 1] is given by 


F,(z) = £ (argmin (Lis + zi) P vs) 5 Z = (Zi, PEP ZN). 
ralin N 
Denoting by £, = (ei Yy)... LN, yr)) the vector of losses of the actions at time t, we 


have the following general result. 


Theorem 4.2. Consider the randomized prediction problem with an oblivious opponent. 
The expected cumulative loss of the follow-the-perturbed-leader forecaster is bounded by 


Aides 
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and also 


L, < sup LO ( in min „Eint 2 max Zii +E mak (— z): 
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Even though in this chapter we only consider losses taking values in [0, 1], it is worth 
pointing out that Theorem 4.2 holds for all, not necessarily bounded, loss functions. Before 
proving the general statement, we illustrate its meaning by the following immediate corol- 
lary. (Boundedness of £ is necessary for the corollary.) 


76 Randomized Prediction 
Corollary 4.4. Assume that the random vector Z, has independent components Zi; uni- 


formly distributed on [0, A], where A > 0. Then the expected cumulative loss of the follow- 
the-perturbed-leader forecaster with an oblivious opponent satisfies 


= Nn 
| min Lin < A+ —. 
=1 A 
In particular, if A = /nN, then 


i min Lin <2VnN. 


Moreover, with any (nonoblivious) opponent, with probability at least 1 — ô, the actual 
regret satisfies 


save 


Proof. On the one hand, we obviously have 


3 max (—Z;;)<0O and E max |Z,;|<A. 
i=1,...,N 1... 
On the other hand, because f(z) = Aie (0,A]”} and F, is between 0 and 1, 


| roro — f(z — &))dz < | f(z) dz 
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1 
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as desired. The last statement follows by applying Lemma 4.1. E 


Note that in order to achieve the bound 2\/nN, previous knowledge of n is required. 
This may be easily avoided by choosing a time-dependent value of A. In fact, by choosing 

= /tN, the bound 2,/2nN is achieved (see Exercise 4.7). Thus the price of not knowing 
the horizon n in advance is at most a factor of /2 in the upper bound. 

The dependence of the upper bound y/n N derived above on the number N of actions is 
significantly worse than that of the O (vn In N) bounds derived in the previous section for 
certain weighted average forecasters. Bounds of the optimal order may be achieved by the 
follow-the-perturbed-leader forecaster if the distribution of the perturbations Z, is chosen 
more carefully. Such an example is described in the following corollary. 


4.3 Follow the Perturbed Leader 77 


Corollary 4.5. Assume that the Zi , are independent with two-sided exponential distribution 
of parameter n > 0, so that the joint density of Z, is f(z) = (n/2)N e~™#, where ||z\|, = 
y |z;|. Then the expected cumulative loss of the follow-the-perturbed-leader forecaster 
against an oblivious opponent satisfies 


’ 


ea 2(1 + 1n N 
L, <e” (n+ Em) 


where L} = minj=1.... 
In particular, if n = min {1, /20 + n N)/((e = DL*)}, then 


Ln — L} < 2,/2(e — 1)L*(1 + InN) + 2(e + DU + InN). 


n— 


Remark 4.2 (Nonoblivious opponent). Just as in Corollary 4.4, an analogous result may 
be established for the actual regret under any nonoblivious opponent. Using Lemma 4.1 
(and L* < n) it is immediate to see that with probability at least 1 — ô, the actual regret 
satisfies 


2M Y) = min | 2 ed, Y) 
t= t= 


< 2,/2(e — In(1 + InN) + Xe + DU +InN) +5 in. 


Proof of Corollary 4.5. Applying the theorem directly does not quite give the desired result. 
However, the following simple observation makes it possible to obtain the optimal depen- 
dence in terms of N . Consider the performance of the follow-the-perturbed-leader forecaster 
on arelated prediction problem, which has the property that, in each step, at most one action 
has a nonzero loss. Observe that any prediction problem may be converted to satisfy this 
property by replacing each round of play with loss vector £, = (a, Yy), UN, yr)) by 
N rounds so that in the ith round only the ith action has a positive loss, namely £(i, y+). The 
new problem has nN rounds of play, the total cumulative loss of the best action does not 
change, and the follow-the-perturbed-leader has an expected cumulative loss at least as large 
as in the original problem. To see this, consider time ¢ of the original problem in which the 
expected loss is €(p,, y;) = >=; K(i, yr) pir. This corresponds to steps N(t — 1) + 1,..., Nt 
of the converted problem. At the beginning of this period of length N , the cumulative losses 
of all actions coincide with the cumulative losses of the actions at time t — | in the orig- 
inal problem. At time N (t — 1) + 1 the expected loss is €(1, y,;)pi,, (where p;,, denotes 
the probability of i assigned by the follow-the-perturbed-leader forecaster at time ¢ in the 
original problem). At time N (t — 1) + 2 in the converted problem the cumulative losses of 
all actions stay the same as in the previous time instant, except for the first action whose 
loss has increased. Thus the probabilities assigned by the follow-the-perturbed-leader fore- 
caster increase except for that of the first action. This implies that at time N(t — 1) +2 
the expected loss is at least (2, y,)p2,+. By repeating the argument N times we see that 
the expected loss accumulated during these N periods of the converted problem is at least 
LA, yi) Pit +- + LON, yi) pws = £, yi). Therefore, it suffices to prove the corollary 
for prediction problems satisfying ||@;||; < 1 for all t. 
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We apply the second inequality of Theorem 4.2. First note that 


Casi a) ee 


3 max Z; +E max (—Z; 1) 
1 N i=1 N 


gating 


[00] 
= I P| max Z; 1 = du 
0 i=1,...,N 


CO 
<2v +2N f P[Zi1 > u|du (for any v > 0) 


2N 
=2v+—e™ 
n 


2(1 + InN 
= ani (by choosing v = In N /n). 


On the other hand, for each z and ż the triangle inequality implies that, 


f2) 


ug F (—n(lizlı — lz — &ll1)) < exp(nll€rll) < e”, 


where we used the fact that ||£,|| < 1 in the converted prediction problem. This proves the 
first inequality. To derive the second, just note that e” < 1 + (e — 1)n for n € [0, 1], and 
substitute the given value of 7 in the first inequality. E 


The follow-the-perturbed-leader forecaster with exponentially distributed perturbations 
has thus a performance comparable to that of the best weighted average predictors (see 
Section 2.4). If L* is not known in advance, the adaptive techniques described in Chapter 2 
may be used. 

It remains to prove Theorem 4.2. 


Proof of Theorem 4.2. The proof is based on an adaptation of the arguments of Section 
3.2. First we investigate a related “forecaster” defined by 


t 
T= argmin (Lis T Zit) = argmin > (eG, Ys) + Zi,s — Zis-1) 
i=1,..,N i=1,...,N 


where we define Z; o = 0 for all i. Observe that T, is not a legal forecaster because it uses 
the unknown values of the losses at time t. However, we show in what follows that J, and 
T, have a similar behavior. To bound the performance of T, we apply Lemma 3.1 for the 
losses €(7, ys) + Zis — Zj,s—1. We obtain 


n 


n 
DCG. y) + Zi Zi) < min YEG, y) + Zis — Zia) 
uw tel 


i 
t=1 


n 
„min, (£ ei, y) + Zi) 
t= 


min Lin + Zitns 
EN 


IA 


IA 


i=1 
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where i* is the index of the overall best action. Rearranging, we obtain 


eG.) < min, Lin + Zir aot Zit- 1 ane 1) 


jest 


t=1 t=1 


< min Lin + ae Zin + S ‘max, (Zips — Zit) ; 


t= Oe Hey 


yiii 


t=1 


Our next aim is to bound the expected value 


Dei») = = 3 bes, Yr). 


t=1 


The _key observation is that, because the opponent is oblivious, for each t, the value 
beh, y) only depends on the vector Z; = (Z1,r, .-., Zy,r) butnoton Z,,s Æ t. Therefore, 
a 1 e, t, Yt) remains unchanged if the Z; , are ieplacd by Zi. , in the definition of T, as 
jon as the marginal distribution of the vector Z; = (Z{ ,,..., Zy.) remains the same as 
that of Z;. In particular, unlike Z),..., Zn, the new vectors Zi, hes Zi, do not need to be 
independent. The convenient choice is to take them all equal so that Z| = Z, =--- = Z. 
Then the inequality derived above implies that 


Ey LG.) < | 
t=1 


= amin Lin +E max Zn 4+ : max (—Z;,1). 
=1 j N i=1,...,N 


i=l 


min Lin + max, Zint 2 max, (Zi 1 - 2] 


The last step is to relate E )~"_, LT, yi) to the expected loss E r1 ECL, yr) of the follow- 
the-perturbed-leader forecaster. To this end, just observe that 


e, y) = / F,(z + £) f dz = / F,(2) f (2 — £) dz. 


Thus, 


Ey eh, y) -E X_C, y) = > | Foo — f(z — £)) dz 
t=1 t=1 t=1 


and the first inequality follows. To obtain the second, simply observe that 


LL, Y) = is 


BELON 
<s u UP Fe) F,(z)f (z — £,) dz 
fD ne 
= SLC, y;). m 
u UP A Cr, Yr) 


4.4 Internal Regret 


In this section we design forecasters able to minimize a notion of regret, which we call 
“internal,” strictly stronger than the regret for randomized prediction analyzed in this 
chapter. A game-theoretic motivation for the study of internal regret is offered in Section 7.4. 
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Recall the following randomized prediction problem: at each time instance f the fore- 
caster (or player) determines a probability distribution p, = (p1,;..., Pwr) over the set of N 
possible actions and chooses an action randomly according to this distribution. In this sec- 
tion, for simplicity, we focus our attention on the expected loss Lp, Y,)= ee LG, Yy) Dit 
and not on the actual random loss to which we return in Section 7.4. 

Up to this point we have always compared the (expected) cumulative loss of the forecaster 
to the cumulative loss of each action, and we have investigated prediction schemes, such as 
the exponentially weighted average forecaster, guaranteeing that the forecaster’s cumulative 
expected loss is not much larger than the cumulative loss of the best action. The difference 


riie 


n N 
= max DD Pi Yo) — eG, Yo) 


i= 
t=1 j=l 


is sometimes called the external regret of the forecaster. Next we investigate a closely 
related but different notion of regret. 

Roughly speaking, a forecaster has a small internal regret if, for each pair of experts 
(i, j), the forecaster does not regret of not having followed expert j each time he followed 
expert i. 

The internal cumulative regret of a forecaster p, is defined by 


Thus, ra, jy. = Pi (Eli, Yi) — L, Yi) = El, = CC, Yi) — EC, Y,)) expresses the fore- 
caster’s expected regret of having taken action i instead of action j. Equivalently, rg, j), is 
the forecaster’s regret of having put the probability mass p;,, on the ith expert instead of on 
the jth one. Now, clearly, the external regret of the forecaster p, equals 


N 
Pea es 2 Rajan SN LE w Ra, Dm» 

which shows that any algorithm with a small (i.e., sublinear in n) internal regret also has 

a small external regret. On the other hand, it is easy to see that a small external regret 

does not imply small internal regret. In fact, as it is shown in the following example, even 

the exponentially weighted average algorithm defined in Section 4.2 may have a linearly 

growing internal regret. 


Example 4.1 (Weighted average forecaster has a large internal regret). Consider the 
following example with three actions A, B, and C. Let n be a large multiple of 3, and 
assume that time is divided in three equally long regimes characterized by a constant loss 
for each action. These losses are summarized in Table 4.1. We claim that the regret R(g.c),n 
of B versus C grows linearly with n, that is, 


; ; 1 n 
liminf — ps, (&B, y) — €C, yD) =y > 0, 


t=1 
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Table 4.1. Example of a problem in which the exponentially 
weighted average forecaster has a large internal regret. 


Regimes L(A, yi) L(B, yi) L(C, yi) 
l<t<n/3 0 1 5 
n/3+1 <t <2n/3 1 0 
2n/3+1<t<n 1 0 —1 


where pg, denotes the weight assigned by the exponentially weighted average forecaster 
to action B: 


eo "LB. 


PBA = enMLhag 4 e-Nhaei 4 eTe’ 

t . . . . . 
where Lis = ey £(i, ys) denotes the cumulative loss of action į and 7 is chosen as usual, 
that is, 


1 /8In3 K 

6V n Vn’ 

with K = /81n3/6. The intuition behind this example is that at the end of the second 
regime the forecaster quickly switches from A to B, and the weight of action C can never 
recover because of its disastrous behavior in the first two regimes. But since action C 
behaves much better than B in the third regime, the weighted average forecaster will regret 


of not having chosen C each time it chose B. Figure 4.1 illustrates the behavior of the 
weight of action B. 


n = 


0.9L — weight of B 
0.8 F 

0.7 F First regime Second regime Third regime 

0.6 F 

0.5 F 

4—1 
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Figure 4.1. The evolution of the weight assigned to B in Example 4.1 for n = 10000. 
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More precisely, we show that during the first two regimes the number of times when 
Pp. is more than £ is of the order of y/n and that, in the third regime, pg, is always more 
than a fixed constant (1/3, say). In the first regime, a sufficient condition for pg, < € is 
e~"/8.1-1 < g, This occurs whenever t > fo = (— Ine) ./n/K. For the second regime, we 
lower bound the time instant t; when pg, gets larger than €. To this end, note that pg, > € 
is equivalent to 


(1 — ee Mar > Eien et + ered), 
Thus, pg, > £ implies that (1 — e)e~"48-1 > ge~""4-1, which leads to 


2n fn. l-e 
ti < + In , 
= 2 K E 
Finally, in the third regime, we have at each time instant Lg -1 < LA,-1 and Lg -1 < 
Lc.-1, 80 that pg > 1/3. Putting these three steps together, we obtain the following lower 


bound for the internal regret of B versus C 


n 


XO pea (tB, y) — &C, y) 


t=1 


n/3 2n/3 
=X pas (&B.y)-—&C.y))+ XO pealt, ys) EC, y) 
t=1 t=n/3+1 


n 


+ JO pas(€B, y) — &C, y) 


t=2n/3+1 
n/n. 1 n yn l~e n 
> —4 1 5 li 
= (5 K nz)e (g ne Jers 
1 
E rece SENT inc Be 
K E K 


1/2, 


which is about n/9 if £ is of the order of n 


This example shows that special algorithms need to be designed to guarantee a small 
internal regret. To construct such a forecasting strategy, we first recall the exponential 
potential function ® : R” — R defined, for all n > 0, by 


1 M 
&(u) = -1n (> om , 
n {=i 


where we set M = N(N — 1). Hence, by the results of Section 2.1, writing r, for the 
M-vector with components rq, ;); and setting R; = 2 1 Ts, we find that any forecaster 
satisfying Blackwell’s condition V®(R,_1)-r; < 0, for all £ > 1, also satisfies 


InN(N -1 
eae a8 a 


tB?, 
2 


max Ry jy. < 
pja N CD 


where 
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By choosing ņ optimally, one then obtains an internal regret bounded by 2B /n In N (see 
Corollary 2.2). 

To design a forecaster satisfying the Blackwell condition, observe that the following 
simple exchange of the order of summation yields 


VER,- a 


= se Va, P(Rr—)Pi (CG, Y) — LG, Y) 
i,j=1 
N N N N 


DU Ve pPR DPE Y) — YO YS Ven PR Pil, Y) 


i=l j=l i=1 j=1 


N N N 
= oa a pR -1)pi ki, Yr) — 5 XO Vun R Dp; LG, Y;) 
eT 


Í 
M 


N 


N N 
= XOG Y) | XO Ve pR- pi — Y Vend Rpr 


i=l j=l k=1 


To guarantee that this quantity is nonpositive, it suffices to require that 


N N 
Pit > Vi, P(Ri-1) — 5 Væ n P(Ri-1) Px, = 0 
j=1 k=1 
for alli = 1,..., N. The existence of such a vector p, = (P1,t, ---, Pwr) may be proven 


by noting that p, satisfies p} A = 0, where A is an N x N matrix whose entries are 


Ace eb TVG DPRR-1) ifi#k, 
ER Ve ji Væ p®R,-ı) otherwise. 


To this end, first note that Vg, j ®(R,—1) = O for all i, j. Let @max = max;,; |A; j| and let 
I be the N x N identity matrix. Then J — A/dmax is a nonnegative row-stochastic matrix, 
and the Perron—Frobenius theorem (see, e.g., Seneta [263]) implies that a probability vector 
q exists such that q’ (Z — A/dmax) = q' 

As such a q also satisfies q" A = 0', the choice p, = q leads to a forecaster satisftying 
the Blackwell condition. Foster and Vohra [107] suggest a gaussian elimination method for 
the practical computation of p,. 

We remark that instead of the exponential potential defined above, other potential func- 
tions may also be used with success. For example, polynomial potentials of the form 


N 2/p 
u) = (> wot] 
i=1 


oe NRG pe < BY2(p — DtN4/?. 


for p > 2 lead to the bound max;, j=1 


From Small External Regret to Small Internal Regret 

We close this section by describing a simple way of converting any external-regret- 
minimizing (i.e., Hannan consistent) forecaster into a strategy to minimize the inter- 
nal regret. Such a method may be defined recursively as follows. At time rf = 1, let 
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Pı = (1/N,..., 1/N) be the uniform distribution over N actions. Consider now f > 1 
and assume that at time t — 1 the forecaster chose an action according to the distribution 
P,_1 = (Pis-1,---, Pn.r—1). For each pair (i, j) of actions (i # j) define the probability 
distribution p by the vector whose components are the same as those of p,_,, except 
that the ith component of p equals zero and the jth component equals p; -1 + Pj,r-1- 
Thus, p is obtained from p,_, by transporting the probability mass from į to j. We 
call these the i —> j modified strategies. Consider now any external-regret-minimizing 
strategy that uses the i —> j modified strategies as experts. More precisely, let A; be 
a probability distribution over the pairs i Æ j that has the property that its cumulative 
expected loss is almost as small as that of the best modified strategy, that is, 


1 E Fi A 1 i ee E 
z yS D Lp”, YAG, pa < min > ip 1 Y) + En 
t=1 ifj t=1 
for some £, — 0. For example, the exponentially weighted average strategy 

exp (—7 Di) ZP, ¥,)) 


en skeet EXP (=n a ae Y,)) 


Ai, jt = 


guarantees that e, = /In(N(N — 1))/(@n) < ./dn N)/n by Theorem 2.2 and for a properly 
chosen value of 77. 
Given such a “meta-forecaster,’ we define the forecaster p, by the fixed point equality 


P; = 5 pi?’ AG, j),t- (4.2) 
G, j):i#j 


The existence of a solution of the defining equation may be seen by an argument similar 
to the one we used to prove the existence of the potential-based internal-regret-minimizing 
forecaster defined earlier in this section. It follows from the defining equality that for 
each f, 


Ep, Y) = X EE, Y) Awan, 
ii 


and therefore the cumulative loss of the forecaster p, is bounded by 
De Y,) < nnls Y) +e 
n tott) = iżj n t get n 
t=1 t=1 
or, equivalently, 


— max Rg, jn < En. 

n ixj (i,j),n = en 
Observe that if the meta-forecaster A, is a potential-based weighted average forecaster, 
then the derived forecaster p, and the one introduced earlier in this section coincide (just 
note that the latter forecaster satisfies Blackwell’s condition). The bound obtained here 
improves the earlier one by a factor of 2. 
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4.5 Calibration 


In the practice of forecasting binary sequences it is common to form the prediction in 
terms of percentages. For example, weather forecasters often publish their prevision in 
statements like “the chance of rain tomorrow is 30%.” The quality of such forecasts may 
be assessed in various ways. In Chapters 8 and 9 we define and study different notions 
of quality of such sequential probability assignments. Here we are merely concerned with 
a simple notion known as calibration. While predictions of binary sequences in terms of 
“chances” or “probabilities” may not be obvious to interpret, a forecaster is definitely not 
doing his job well if, on a long run, the proportion of all rainy days for which the chance of 
rain was predicted to be 30% does not turn out to be about 30%. In such cases we say that 
the predictor is not well calibrated. 

One may formalize the setup and the notion of calibration as follows. We consider a 
sequence of binary-valued outcomes y1, y2,... E {0, 1}. Assume that, at each time instant, 
the forecaster selects a decision q; from the interval [0, 1]. The value of q; may be interpreted 
as the prediction of the “probability” that y, = 1. In this section, instead of defining a “loss” 
of predicting q; when the outcome is y,, we only require that for any fixed x € [0, 1] of 
the time instances when q; is close to x, the proportion of times with y, = 1 should be 
approximately x. To quantify this notion, for all e > O define pf (x) to be the average of 
outcomes yy at the times t = 1,..., when a value q, close to x was predicted. Formally, 


n 
we Yt Ttq,e(x—e,x+6)} 
n I 
yet {qse(x—e,x+e)} 


p, (x) = 


If X; [igetr—e,x-+6)} = 0, we say pf (x) = 0. 
The forecaster is said to be e-calibrated if, for all x € [0,1] for which 


: 1 
lim SUP; 5 oo n aa Lge exte)} > 0, 


lim sup | of (x) — x| <E. 
n—> Oo 
A forecaster is well calibrated if it is -calibrated for all € > 0. 
For some sequences it is very easy to construct well-calibrated forecasters. For 
example, if the outcome sequence happens to be stationary in the sense that the limit 
lim, 1 M yı exists, then the forecaster 


is well calibrated. (The proof is left as an easy exercise.) 

However, for arbitrary sequences of outcomes, constructing a calibrated forecaster is 
not a trivial issue. In fact, it is not difficult to see that no deterministic forecaster can 
be calibrated for all possible sequences of outcomes (see Exercise 4.9). Interestingly, 
however, if the forecaster is allowed to randomize, well-calibrated prediction is possible. 
Of the several known possibilities the one we describe next is based on the minimization 
of the internal regret of a suitably defined associated prediction problem. The existence of 
internal regret minimizing strategies, proved in the previous section, will then immediately 
imply the existence of well-calibrated forecasters for all possible sequences. 

In what follows, we assume that the forecaster is allowed to randomize. We adopt 
the model described at the beginning of this chapter. That is, at each time instance t, 
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before making his prediction q;, the forecaster has access to a random variable U;, where 
U,, U2, ... is a sequence of independent random variables uniformly distributed in [0, 1]. 
These values remain hidden from the opponent who sets the outcomes y; € {0, 1}, where 
yı may depend on the predictions of the forecaster up to time t — 1. 


Remark 4.3. Once one allows randomized forecasters, it has to be clarified whether the 
opponent is oblivious or not. As noted earlier, we allow nonoblivious opponents. To keep the 
notation coherent, in this section we continue using lower-case y, to denote the outcomes, 
but we keep in mind that y; is a random variable, measurable with respect to U;,..., U;—1. 


The first step of showing the existence of a well-calibrated randomized forecaster is noting 
that it suffices to construct ¢-calibrated forecasters. This observation may be proved by a 
simple application of a doubling trick whose details are left as an exercise. 


Lemma 4.2. Suppose for each ¢ > O there is a forecaster that is -calibrated for all possible 
sequences of outcomes. Then a well-calibrated forecaster can be constructed. 


To construct an e-calibrated forecaster for a fixed positive £, it is clearly sufficient to consider 
strategies whose predictions are restricted to a sufficiently fine grid of the unit interval [0, 1]. 
In the sequel we consider forecasters whose prediction takes the form q; = I;/N, where N 
is a fixed positive integer and /, takes its values from the set {0,1,..., N}. At each time 
instance f, the forecaster determines a probability distribution p, = (po, ..., PN,t) and 
selects prediction /; randomly according to this distribution. The quality of such a predictor 
may be assessed by means of the discretized version of the function p, defined as 


ee Yt Iu, =i} 
P= lus) 


Define the quantity C,,, often called the Brier score, as 


N i 2 1 n 
C=: (oam - =) (: Shu) 


i=0 t=1 


Prali /N) = fori =0,1,...,N. 


The following simple lemma shows that it is enough to make sure that the Brier score 
remains small asymptotically. The routine proof is again left to the reader. 


Lemma 4.3. Let £ > 0 be fixed and assume that (N + 1)/N? > 1/e?. If a discretized 
forecaster is such that 


li C < N+1 
im sup C, < ——— 
ee N2 


almost surely, 


then the forecaster is almost surely €-calibrated. 


According to Lemmas 4.2 and 4.3, to prove the existence of a randomized forecaster that is 
well calibrated for all possible sequences of outcomes it suffices to exhibit, for each fixed N, 
a “discretized” forecaster J; such that lim sup,,., Cn < (N + 1)/N? almost surely. Such 
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a forecaster may be constructed by a simple application of the techniques shown in the 
previous chapter. To this end, define a loss function by 


LNA 
, i l 
ax = (=) ; i=0,1,...,N. 


Then, by the results of Section 4.4, one may define a forecaster whose randomizing distri- 


bution p, = (Por, ---, Pw,r) is such that the average internal regret 
n 
G j)n 1 
max = = Fi, j),t 
i, j=0,1,...,N n i,j=0,1,..,.N n 


1 ; ; 
= 1 IE yp 2 Pie Ub =G) 
converges to zero for all possible sequences of outcomes. The next result guarantees that 
any such predictor has the desired property. 


Lemma 4.4. Consider the loss function defined above and assume that p;, Po, ... is such 
that, for all sequences of outcomes, 
j 1 

lim max —R@ jpn = 0. 

n> i, j=0,1,...,N n 
If the forecaster is such that I, is drawn, at each time instant, randomly according to the 
distribution p,, then 

N +1 


lim sup C, < 
n> 


almost surely. 


Proof. Define, for each n = 1,2, ... andi = 0, 1,..., N, 


Det Pit 1 YtPi,t 


= (i /N) = 
Pri /N) = S pu 


and also 


i=0 


= = (Firm) - a i Dn ) 


By martingale convergence (e.g., by Lemma A.7 and the Borel—Cantelli lemma) we 
have 


n n 
im 7 D Yı lu =i} — = 2 yt Pit| =O almost surely 
and, similarly, 
n 1 n 
lim |— y I=} — — 5 Pis| =0 almost surely. 
n>oln 
t=1 t=1 
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This implies that 


lim |C, — C n| =O almost surely 
n> 


and therefore it suffices to prove that 


= N+1 
lim sup C, < a 
n—> oœ 


To this end, observe that 
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and therefore 
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For any fixed i = 0, 1,..., N, the quantity Rij), is maximized for the value of j mini- 
mizing (p,(i/N) — j/N)*. Thus, 


n $ 2 
(£ pu) (Pem = x) 


aN 2. 
be i j 
= a Rajn t. ai (SP. ) (am - i) 
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which concludes the proof. W 


In summary, if N is sufficiently large, then any forecaster designed to keep the cumula- 
tive internal regret small, based on the loss function £(i, y+) = (y; — i /N)?, is guaranteed to 
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be £-calibrated. Because internal-regret-minimizing algorithms are not completely straight- 
forward (see Section 4.4), the resulting forecaster is somewhat complex. 

Interestingly, the form of the quadratic loss function is important. For example, fore- 
casters that keep the internal regret based on the “absolute loss” (i, y,) = |y; — i /N | small 
cannot be calibrated (see Exercise 4.14). 

We remark here that the forecaster constructed above has some interesting extra features. 
In fact, apart from being e-calibrated, it is also a good predictor in the sense that its cumu- 
lative squared loss is not much larger than that of the best constant predictor. Indeed, since 
external regret is at most N times the internal regret max;, ;=0,1,...,.n RG, j),n, we immediately 
find that the forecaster constructed in the proof of Lemma 4.4 satisfies 


1/< Ee dine Se i \? 
P (> (> = +) = eee 5 (> = x) > 0 almost surely. 


t=1 t=1 


(See, however, Exercise 4.13.) 
In fact, no matter what algorithm is used, calibrated forecasters must have a low excess 
squared error in the above sense. Just note that the proof of Lemma 4.4 reveals that 


Thus, any forecaster that keeps C,, small has a small internal regret (defined on the basis 
of the quadratic losses €(i, y;) = (y; —i/N)*) and therefore has a small external regret as 
well. 

We conclude by pointing out that all results of this section may be extended to forecasting 
nonbinary sequences. Assume that the outcomes y; take their values in the finite set Y = 
{1,2,...,m}. Then at time ¢ the forecaster outputs a vector q, € D in the probability 
simplex 


DS SPS iss Pa): X pj=1, p20, Sm POR 
j=l 
Analogous to the case of binary outcomes, in this setup we may define, for all x € D, 


K I 
iat Ye Ua, ix-aii<e) 


Pra) = 
” Ži- Ha : Ix-a<e) 
(with £x) = 0 if $; Lia : x-a: <e} = 9), where || - || denotes the euclidean distance and 
y, denotes the m-vector (I, =1), - -< , ty, =m). 


The forecaster is now -calibrated if we have 


lim sup || o£ (x) — x| <E 
n—> oœ 


for all x € D with lim sup,,_, 1 yr Lia :Ix-q1<2} > 0. The forecaster is well calibrated 
if it is -calibrated for all ¢ > 0. The procedure of constructing a well-calibrated forecaster 
may be extended in a straightforward way to this more general setup. The details are left to 
the reader. 
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Remark 4.4. We introduced the notion of ¢-calibratedness by requiring 


lim sup |p. (x) — x| <e (4.3) 


n> œ 


for all x € [0, 1] satisfying lim sup, s 1 yr lig,eœ-e.x+e) > 0. A simple inspection of 
the proof of the existence of well-calibrated forecasters reveals, in fact, the existence of a 
forecaster satisfying (4.3) for all x € [0, 1] with lim sup, _,,,n~°* er Itq,e—e,x+e)} > 9, 
where g is any number greater than 1/2. Some authors consider an even stronger definition 
by requiring (4.3) to hold for all x for which }7"_, Ijg,e(.—e,x+«)}) tends to infinity. See the 
bibliographical comments for references to works proving that calibration is possible under 
this stricter definition as well. 


4.6 Generalized Regret 


In this section we consider a more general notion of regret that encompasses internal and 
external regret encountered in previous sections and makes it possible to treat, in a unified 
framework, a variety of other notions of regret. 

We still work within the framework of randomized prediction in which the fore- 
caster determines, at every time instant f= 1,...,n, a probability distribution p, = 
(Pit, ---, Pn.) over N actions and chooses an action /; according to this distribution. 
Similarly to the model introduced in Chapter 2, we compare the performance of the 
forecaster to the performance of the best expert in a given set of m experts. However, 
as forecasters here use randomization, we allow the predictions of experts at each time 
t to depend also on the forecaster’s random action /,. We use fi (l) € {1,...,N} to 
denote the action taken by expert i at time ¢ when Z, is the forecaster’s action. In this 
model, the expert advice at time ¢ thus consists of m vectors ( Fire), ---5 Fie )) for 
i=1,...,m. 

For each expert i = 1,...,m and time t we also define an activation function Aj; : 
{1,..., N} — {0, 1}. The activation function determines whether the corresponding expert 
is active at the current prediction step. At each time instant f the values A; (k) (i = 1, ...,m, 
k =1,..., N) of the activation function and the expert advice are revealed to the forecaster 
who then computes p, = (P1.1, --., Pn). Define the (expected) generalized regret of a 
randomized forecaster with respect to expert i at round t by 


N 


rit = > Pra AiE, Yi) — €( fir), Yd) 


k=1 


= E, [An IDE, Y) — fid), Yd) | - 


Hence, the generalized regret with respect to expert i is nonzero only if expert 7 is active, 
and the expert is active based on the current step t and, possibly, on the forecaster’s guess 
k. In this model we allow that the functions fj, and A;,; depend on the past random choices 
I,,...,1;-1 of the forecaster (determined by a possibly nonoblivious adversary) and are 
revealed to the forecaster just before round f. 

The following examples show that special cases of the generalized regret include external 
and internal regret and extend beyond these notions. 
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Example 4.2 (External regret). Takingm = N, fi (k) =i forallk = 1,..., N,and letting 
Ai (k) to be the constant function 1, the generalized regret becomes 


N 


rit = J pra (tk, Yi) — £6, Y), 


k=1 


which is just the (external) regret introduced in Section 4.2. 


Example 4.3 (Internal regret). To see how the notion of internal regret may be recovered 
as a special case, consider m = N(N — 1) experts indexed by pairs (i, j) for i Æ j. For 
each ¢ expert (i, j) predicts as follows: fa, j)s(k) =k if k Ai and fa, j(i) = j otherwise. 
Thus, if Aç, ),, = 1, component (i, j) of the generalized regret vector r; € R” becomes 


rajt = Pis (eki, Yi) — LG, Yo) 


which coincides with the definition of internal regret in Section 4.4. 


Example 4.4 (Swap regret). A natural extension of internal regret is obtained by letting the 


experts f,,..., fn compute all N” functions {1, ..., N} —> {1,..., N} defined on the set 
of actions. In other words, for each i = 1,...,m = N” the sequence (f;(1),..., A(N)) 
takes its values in {1,..., N}. Note that, in this case, the prediction of the experts depends 


on ¢ only through /,. The resulting regret is called swap regret. It is easy to see (Exercise 4.8) 
that 


max Ro, < N max Rii, j),n, 
oes iFj i 


where È is the set of all functions {1,..., N} —> {1,..., N} and Ro is the regret against 
the expert indexed by o. Thus, any forecaster whose average internal regret converges to 
zero also has a vanishing swap regret. Note that the forecaster of Theorem 4.3, run with the 
exponential potential, has a swap regret bounded by O(./n(N In N)). This improves on the 
bound O(N nln N), which we obtain by combining the bound of Exercise 4.8 with the 
regret bound for the exponentially weighted average forecaster of Section 4.2. 


This more general formulation permits us to consider an even wider family of prediction 
problems. For example, by defining the activation function A; (k) to depend on ¢ and i 
but not on k, one may model “specialists,” that is, experts that occasionally abstain from 
making a prediction. In the next section we describe an application in which the activation 
functions play an important role. See also the bibliographic remarks for other variants of 
the problem. 

The basic question now is whether it is possible to define a forecaster guaranteeing that 


the generalized regret Rin =7),1 +---+7in grows slower than n for all i = 1,...,m, 
regardless of the sequence of outcomes, the experts, and the activation functions. Let 
R, = (Rin, -.--, Rm,n) denote the vector of regrets. By Theorem 2.1, it suffices to show 


the existence of a predictor p, satisfying the Blackwell condition V ®(R,—1)- r; < 0 for 
an appropriate potential function ®. The existence of such p, is shown in the following 
result. For such a predictor we may then apply Theorem 2.1 and its corollaries to obtain 
performance bounds without further work. 
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Theorem 4.3. Fix a potential ® with V® > 0. Then a randomized forecaster satisfying the 
Blackwell condition for the generalized regret is defined by the unique solution to the set of 
N linear equations 


Eii Pie Diet lyo Ai) ViPR1) 


; k=1,...,N. 
ey Air(k) ViR) 


Pkt = 


Observe that in the special case of external regret (i.e., m = N, fi(k) =i for all k = 
1,...,N,and A; = 1), the predictor of Theorem 4.3 US to the usual weighted average 
forecaster of Section 4.2. Also, in the case of internal regret described in Example 4.3, the 
forecaster of Theorem 4.3 reduces to the forecaster studied in Section 4.4. 


Proof. The proof is a generalization of the argument we used for the internal regret in 
Section 4.4. We may rewrite the left-hand side of the Blackwell condition as follows: 
VÈR-1): r, 
= any, PR,- DY Pr, Ai (Elk, Yi) — €fir(k), Y1) 


i=1 k=1 


N N m 
= Nie DVOR, -1) pe Ai Ok, Y) — € fir), Yo) 
k= 


N N m 
EE 2 lwn Vi PR) pr Au G, Yi) 


N N m 
=) ly wanVi PRP Ain Wek, Yi) 


k=l j=l i=l 


N N m 
-92 2 To- Vi PR) pj Ain Dek, Yi) 


= Sea, Y,) [Èv P(R,—1) Pr, Ai s(k) 


k=1 i=l 
N m 
-> 2 luo YVER, Pp; An C) 


Because the €(k, Y,) are arbitrary and nonnegative, the last expression is guaranteed to be 
nonpositive whenever 


N m 


Pri Y ViP RA) — YO o= Vi PRP jr AnG) = 0 


i=l j=l i=l 


for each k = 1,..., N. Solving for px; yields the result. It remains to check that such a 
predictor always exists. For clarity, let c;,; = V; ®(R;—1)Aj;,+(/), so that the above condition 
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can be written as 


m N m 
Dk,t Sak = X pj DO oci =0 for k = 1, EET N. 
i=1 j=1 i=1 


Now note that this condition is equivalent to p: H = 0, where p, = (piz,.--, Pw.) and H 
isan N x N matrix whose entries are 


Hepa | pittance fk Aj, 
A Viet loci j otherwise. 


As in the argument in Section 4.4, let Amax = max;, j |H; j| and let J be the N x N iden- 
tity matrix. Since V® > 0, matrix 7 — H / hmax is nonnegative and row stochastic. The 
Perron—Frobenius theorem then implies that there exists a probability vector q such that 
q' (I — H /hmax) = q. Because such a q also satisfies q" H = 0', the choice p, = q leads 
to a forecaster satisftying the Blackwell condition. W 


4.7 Calibration with Checking Rules 


The notion of calibration has some important weaknesses. While calibratedness is an impor- 
tant basic property one expects from a probability forecaster, a well-calibrated forecaster 
may have very little predictive power. To illustrate this, consider the sequence of outcomes 
{y,} formed by alternating 0’s and 1’s: 01010101....The forecaster q; = 1/2 is well cali- 
brated but unsatisfactory, as any sensible forecaster should, intuitively, predict something 
close to 0 at time t = 2 x 10° + 1 after having seen the pattern “01” repeated a million 
times. 

A way of strengthening the definition of calibration is by introducing so-called checking 
rules. Consider the randomized forecasting problem described in Section 4.5. In this prob- 
lem, at each time instance t = 1, 2, ... the forecaster determines his randomized prediction 
q, and then observes the outcome Y,. A checking rule A assigns, to each time instance 
t, an indicator function A;(x) = Iixes,} of a set S, € [0, 1] and reveals it to the forecaster 
before making a prediction q;. The set S, may depend on the past sequence of outcomes 
and forecasts q1, Y1,..-,9:-1, Y:;-1. (Formally, A; is measurable with respect to the fil- 
tration generated by the past forecasts and outcomes.) The role of checking rules is the 
same as that of the activation functions introduced in Section 4.6. Given a countable family 
{AM, A®), ...} of checking rules, the goal of the forecaster now is to guarantee that, for all 
checking rules A® with lim sup, 4 X; AP (q) > 0, 


n k n k 
m ae yA} GQ) ee qA! (qr) 
Po AG ee eo) 


We call a forecaster satisfying this property well calibrated under checking rules. 

Thus, any checking rule selects, on the basis of the past, a subset of [0, 1], and we 
calculate the average of the forecasts only over those periods in which the checking rule 
was active at g,. Before making the prediction, the forecaster is warned on what sets 
the checking rules are active at that period. The goal of the predictor is to force the average 
of his forecasts, computed over the periods in which a checking rule was active, to be 
close to the average of the outcomes over the same periods. By considering checking rules 
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of the form A; (q) = Iygecx—e,x+e)} We recover the original notion of calibration introduced 
in Section 4.5. However, by introducing checking rules that depend on the time instance 
t and/or on the past, one may obtain much more interesting notions of calibration. For 
example, one may include checking rules that are active only at even periods of time, others 
that are active at odd periods of time, others that are only active if the last three outcomes 
were 0, etc. One may even consider a checking rule that is only active when the average of 
past outcomes is far away from the average of the past forecasts, calculated over the time 
instances in which the rule was active in the past. 

By a simple combination of Theorem 4.3 and the arguments of Section 4.5, it is now 
easy to establish the existence of well calibrated forecasters under a wide class of checking 
rules. Just like in Section 4.5, the first step of the argument is a discretization. Fix ane > 0 
and consider first a division of the unit interval [0, 1] into intervals of length N = |1/e]. 
Consider checking rules A such that, for all ¢ and histories q1, Y1,...,9:-1, Yr-1, Ar is an 
indicator function of a union of some of these N intervals. Call a checking rule satisfying 
this property an e-checking rule. The proof of the following result is left as an exercise. 


Theorem 4.4. Let e >0. For any countable family {A”,A®,...} of e&-checking 
rules, there exists a randomized forecaster such that, almost surely, for all A® with 
: 1 n (k) 

lim SUPy 550 q Dera Ar (Gr) > 0, 


n (k) n (k) 
lim sup dear Ar GQ) ear WAr qi) = 
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4.8 Bibliographic Remarks 


Randomized forecasters have been considered in various different setups; see, for example, 
Feder, Merhav, and Gutman [95], Foster and Vohra [107], Cesa-Bianchi and Lugosi [51]. 
For surveys we refer to Foster and Vohra [107], Vovk [300], Merhav and Feder [214]. The 
first Hannan consistent forecasters were proposed by Hannan [141] and Blackwell [29]. 
Hannan’s original forecaster is based on the idea of “following the perturbed leader,” 
as described in Section 4.3. Blackwell’s procedure is, instead, in same the spirit as the 
weighted average forecasters based on the quadratic potential. (This is described in detail 
in Section 7.5.) 

The analysis of Hannan’s forecaster presented in Section 4.3 is due to Kalai and Vem- 
pala [174]. Hutter and Poland [164] provide a refined analysis. Hart and Mas-Colell [146] 
characterize the whole class of potentials for which the Blackwell condition (4.1) yields a 
Hannan consistent player. 

The existence of strategies with small internal regret was first shown by Foster and 
Vohra [105], see also Fudenberg and Levine [118,121] and Hart and Mas Colell [145, 146]. 
The forecaster described here is based on Hart and Mas Colell [146] and is taken from Cesa- 
Bianchi and Lugosi [54]. The example showing that simple weighted average predictors are 
not sufficient appears in Stoltz and Lugosi [279]. The method of converting forecasters with 
small external regret into the ones with small internal regret described here was suggested 
by Stoltz and Lugosi [279] — see also Blum and Mansour [34] for a related result. The 
swap regret was introduced in [34], where a polynomial time algorithm is introduced that 
achieves a bound of the order of “nN InN for the swap regret. 
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The notion of calibration was introduced in the weather forecasting literature by 
Brier [42], after whom the Brier score is named. It was Dawid [80] who introduced 
the problem to the literature of mathematical statistics — see also Dawid’s related work on 
calibration and “prequential” statistics [81, 82,84]. Oakes [226] and Dawid [83] pointed 
out the impossibility of deterministic calibration. The fact that randomized calibrated fore- 
casters exist was proved by Foster and Vohra [106], and the construction of the forecaster 
of Section 4.5 is based on their paper. Foster [104], Hart (see [106]), and Fudenberg and 
Levine [120] also introduce calibrated forecasters. The constructions of Hart and Fuden- 
berg and Levine are based on von Neumann’s minimax theorem. The notion of calibration 
discussed in Section 4.5 has been strengthened by Lehrer [194], Sandroni, Smorodin- 
sky, and Vohra [259], and Vovk and Shafer [301]. These last references also consider 
checking rules and establish results in the spirit of the one presented in Section 4.7. 
These last two papers construct well-calibrated forecasters using even stronger notions of 
checking rules. Kakade and Foster [171] and Vovk, Takemura, and Shafer [302] indepen- 
dently define a weaker notion of calibration by requiring that for all nonnegative Lipschitz 
functions w, 


le 
Jim — Dwar — 41) = 0. 


They show the existence of a deterministic weakly calibrated forecaster and use this to 
define an “almost deterministic” calibrated forecaster. Calibration is also closely related 
to the notion of “merging” originated by Blackwell and Dubins [30]. Kalai, Lehrer, and 
Smorodinsky [177] show that, in fact, these two notions are equivalent in some sense, see 
Kalai and Lehrer [176], Lehrer, and Smorodinsky [196], Sandroni and Smorodinsky [258] 
for related results. 

Generalized regret (Section 4.6) was introduced by Lehrer [195] (see also Cesa- 
Bianchi and Lugosi [54]). The “specialist” algorithm of Freund, Schapire, Singer, and 
Warmuth [115] is a special case of the forecaster for generalized regret defined in Theo- 
rem 4.3. The framework of specialists has been applied by Cohen and Singer [65] to a text 
categorization problem. 


4.9 Exercises 


4.1 Let B, : Y” — [0, oo] be a function that assigns a nonnegative number to any sequence y” = 
(Y1; ---, Yn) Of outcomes, and consider a randomized forecaster whose expected cumulative 
loss satisfies 


sup (Edw. yr) a son) ss 0. 
yn t=1 


Show that if the same forecaster is used against a nonoblivious opponent, then its expected 
performance satisfies 


E [Eea Y,)- za <0 


t=1 
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4.2 


4.3 


4.4 


4.5 


4.6 


4.7 


4.8 
4.9 


4.10 


4.11 
4.12 
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(Hutter and Poland [164]). Hint: Show that when the forecaster is used against a nonoblivious 
opponent, then 


sup E; sup E2... sup En b EC, yr) 2.0] <0 


MEY EY yn€Y T 


Proceed as in Lemma 4.1. 


Prove Corollary 4.3. First show that the difference between the cumulative loss Fo (L, Yi) 
and its expected value };_; @(p,, Y+) is, with overwhelming probability, not larger than some- 
thing of the order of y/n. Hint: Use the Hoeffding—Azuma inequality and the Borel—Cantelli 
lemma. 


Investigate the same problem as in the previous exercise but now for the modification of the 
randomized forecaster defined as 


pus 6 (Ris) 

pyre oR) 

where R;, = Seo (Cs, Y,) — Ll fiss Y,)). Warning: There is an important difference here 

between different choices of ø. 6(x) = e” is the easiest case. 

Consider the randomized forecasters of Section 4.2. Prove that the cumulative loss L,, of the 

pe cae weighted average forecaster is always at least as large as the cumulative loss 
TE N Lin Of the best pure strategy. Hint: Reverse the proof of Theorem 2.2. 

Prove Theorem 4.1. Hint: The first statement is straightforward after defining appropriately the 

exponentially weighted average strategy. To prove the second statement, combine the first part 

with the results of Section 2.8. 

Show that the regret of the follow-the-leader algorithm (i.e., fictitious play) mentioned at the 

beginning of Section 4.3 is bounded by the number of times the “leader” (i.e., the action with 

minimal cumulative loss) is changed during the sequence of plays. 

Consider the version of the follow-the-perturbed-leader forecaster in which the distribution of 

Z, is uniform on the cube [0, A,]” where A, = tN. Show that the expected cumulative loss 

after n rounds is bounded as 


Li - min | Lin < 2V2nN. 


Eh eae 


Conclude that the forecaster is Hannan consistent (Hannan [141], Kalai and Vempala [174]). 
Show that the swap regret is bounded by N times the internal regret. 


Consider the setup of Section 4.5. Show that ife < 1/3, then for any deterministic forecaster, that 
is, for any sequence of functions q, : {0, 1}'-! — [0, 1],¢ = 1,2,..., there exists a sequence 
of outcomes y1, y2,... such that the forecaster q; = ay!) is not e-calibrated (Oakes [226], 
Dawid [83]). 


Show that if lim, 1 4 yı exists, then the deterministic forecaster 


is well calibrated. 
Prove Lemma 4.2. 


Prove Lemma 4.3. 


4.13 


4.14 


4.15 


4.16 


4.17 
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This exercise points out that calibrated forecasters are necessarily bad in a certain sense. Show 
that for any well-calibrated forecaster q, there exists a sequence of outcomes such that 


: : 1 n . n n 
lim inf — (Èn -qıl = min} -= 0l, $ Ix: — uf) > 0. 
t=1 t=1 t=1 


Let N > 0 be fixed and consider any forecaster that keeps the internal regret based on the 
absolute loss 


1 


ix 
pmax GRadn =, MAX > 2 Pia (ly: = i/N| = |y: — j/NI) 


small (i.e., it converges to 0). Show that for any €, if N is sufficiently large, such a forecaster 
cannot be e calibrated. Hint: Use the result of the previous exercise. 


(Uniform calibration) Recall the notation of Section 4.5. A forecaster is said to be uniformly 
well calibrated if for all € > 0, 


lim sup sup |0 (x) — x| < €. 
n>oo xe[0,1] 


Show that there exists a uniformly well-calibrated randomized forecaster. 


(Uniform calibration continued) An even stronger notion of calibration is obtained if we 
define, for all sets A C [0, 1], 
drat Ye Wea 

Xs- Iseal 


Let A, = {x : dy € A such that |x — y| < e} be the -blowup of set A and let à denote the 
Lebesgue measure. Show that there exists a forecaster such that, for all € > 0, 


Sa; xdx 
A(Ae) 


P(A) = 


lim sup sup 
n>oo A 


Pn (A) a 


< &, 


where the supremum is taken over all subsets of [0, 1]. (Note that if the supremum is only taken 
over all singletons A = {x}, then we recover the notion of uniform calibration defined in the 
previous exercise.) 


(Uniform calibration of non-binary sequences) We may extend the notion of uniform cal- 
ibration to the case of forecasting non-binary sequences described at the end of Section 4.5. 
Here y; € Y = {1, 2, ..., m} and the forecaster outputs a vector q, € D. For any subset A of 
the probability simplex D, define 


= ret Y: laea 
A= 
pe Tig.) 
where y, = (Iy,=1),---» Iy,=m)). Writing A, = {x : Ex € A) |x — x’|| < £} and A for the 


uniform probability measure over D, show that there exists a forecaster such that for all A c D 
and for all € > 0, 


Ja xdx 


lim sup A) 


n= 


Pn (Az) ae 


|<. 


Prove the uniform version of this result by showing the existence of a forecaster such that, for 
alle > 0, 


Pn (A) Fi 


lim sup sup 
n>oo A 


A(Ae) 


Jy oe 


where the supremum is taken over all subsets of D. 
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4.18 (Hannan consistency for discounted regret) Consider the problem of randomized prediction 
described in Section 4.1. Let Bo > 61 > --- be a sequence of positive discount factors. A 
forecaster is said to satisfy discounted Hannan consistency if 


0 


lim sup 


noo yi Bn- 
with probability 1. Derive sufficient and necessary conditions for the sequences of discount 
factors for which discounted Hannan consistency may be achieved. 


4.19 Prove Theorem 4.4. Hint: First assume a finite class of e-checking rules and show, by com- 
bining the arguments of Section 4.5 with Theorem 4.3, that for any interval of the form (i/N, 
@ + 1)/N], 


n 
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lim sup 7 
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whenever lim sup,_,, 7 Dora L ,> 0. 
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Efficient Forecasters for Large Classes 
of Experts 


5.1 Introduction 


The results presented in Chapters 2, 3, and 4 show that it is possible to construct algorithms 
for online forecasting that predict an arbitrary sequence of outcomes almost as well as the 
best of N experts. Namely, the per-round cumulative loss of the predictor is at most as large 
as that of the best expert plus a term proportional to /In N /n for any bounded loss function, 
where n is the number of rounds in the prediction game. The logarithmic dependence on 
the number of experts makes it possible to obtain meaningful bounds even if the pool of 
experts is very large. However, the basic prediction algorithms, such as weighted average 
forecasters, have a computational complexity proportional to the number of experts, and 
they are therefore infeasible when the number of experts is very large. 

On the other hand, in many applications the set of experts has a certain structure that 
may be exploited to construct efficient prediction algorithms. Perhaps the best known such 
example is the problem of tracking the best expert, in which there is a small number of 
“base” experts and the goal of the forecaster is to predict as well as the best “compound” 
expert. This expert is defined by a sequence consisting of at most m + 1 blocks of base 
experts so that in each block the compound expert predicts according to a fixed base expert. 
If there are N base experts and the length of the prediction game is n, then the total number 
of compound experts is © (N (nN / m)'") , exponentially large in m. In Section 5.2 we develop 
a forecasting algorithm able to track the best expert on any sequence of outcomes while 
requiring only O(N) computations in each time period. 

Prototypical examples of structured classes of experts for which efficient algorithms 
have been constructed include classes that can be represented by discrete structures such 
as lists, trees, and paths in graphs. It turns out that many of the forecasters we analyze in 
this book can be efficiently calculated over large classes of such “combinatorial experts.” 
Computational efficiency is achieved because combinatorial experts are generated via 
manipulation of a simple base structure (i.e., a class might include experts associated with 
all sublists of a given list or all subtrees of a given tree), and for this reason their predictions 
are tightly related. 

Most of these algorithms are based on efficient implementations of the exponentially 
weighted average forecaster. We describe two important examples in Section 5.3, concerned 
with experts defined on binary trees, and in Section 5.4, devoted to experts defined by paths 
in a given graph. A different approach uses follow-the-perturbed-leader predictors (see 
Section 4.3) that may be used to obtain efficient algorithms for a large class of problems, 
including the shortest path problem. This application is described in Section 5.4. The 
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purpose of Section 5.5 is to develop efficient algorithms to track the best expert in the case 
when the class of “base” experts is already very large and has some structure. This is, in a 
sense, a combination of the two types of problems described above. 

For simplicity, we analyze combinatorial experts in the framework of randomized pre- 
diction (see Chapter 4). Most results of this chapter (with the exception of the algorithms 
using the follow-the-perturbed-leader forecaster) can also be presented in the deterministic 
framework developed in Chapters 2 and 3. Following the model and convention introduced 
in Chapter 4, we sometimes use the term action instead of expert. The two are synonymous 
in the context of this chapter. 


5.2 Tracking the Best Expert 


In the basic model of randomized prediction the forecaster’s predictions are evaluated 
against the best performing single action. This criterion is formalized by the usual notion 
of regret 


eet 


which expresses the excess cumulative loss of the forecaster when compared with strategies 
that use the same action all the time. In this section we are interested in comparing the 
cumulative loss of the forecaster with more flexible strategies that are allowed to switch 
actions a limited number of times. (Recall that in Section 4.2 we briefly addressed such 
comparison classes of “dynamic” strategies.) This is clearly a significantly richer class, 
because there may be many outcome sequences Y;,..., Y,, such that 


is large, but there is a partition of the sequence in blocks Y H RY, jae +++, Yp of consecutive 
outcomes so that on each block Y, ao = (%,,---, %n4,-1) some action i, performs very 
well. 

Forecasting strategies that are able to “track” the sequence i1, i2,...,im41 of good 
actions would then perform substantially better than any individual action. This motivates 
the following generalization of regret. Fix a horizon n. Given any sequence i1, ..., 7, of 
actions from {1,..., N }, define the tracking regret by 


Ri, -s in) = De YD — Do i, Yd, 
t=1 t=1 


where, as usual, /;,..., /,, is the sequence of randomized actions drawn by the forecaster. 
To simplify the arguments, throughout this chapter we study the “expected” regret 


RG, .--5in) = > EP, YD — >) eG, Yd, 
t=1 t=1 


where, following the notation introduced in Chapter 4, P, = (pi.,..., PN,t) denotes 
the distribution according to which the random action J; is drawn at time ¢, and 
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Lp, Y,= Ee DistU;, Y+) is the expected loss of the forecaster at time t. Recall from 
Section 4.1 that, with probability at least 1 — ô, 


Ya.) < Yt. ¥9)+ sin 


and even tighter bounds can be established if $`; £(p,, Y;) is small. 

Clearly, it is unreasonable to require that a forecaster perform well against the best 
sequence of actions for any given sequence of outcomes. Ideally, the tracking regret should 
scale, with some measure of complexity, penalizing action sequences that are, in a certain 
sense, harder to track. To this purpose, introduce 


a 


n 
size (ii, a ote in) = 5 geiz 


counting how many switches (i+, i++1) with i; Æ i++1 occur in the sequence. Note that 


2 PYD- min elie Ye) 
t 
corresponds to the usual (nontracking) regret. 

It is not difficult to modify the randomized forecasting strategies of Chapter 4 in order to 
achieve a good tracking regret against any sequence of actions with a bounded number of 
switches. We may simply associate a compound action with each action sequence i1, ... , in 
so that size (i1, ..., in) < m for some m and fixed horizon n. We then run our randomized 
forecaster over the set of compound actions: at any time ¢ the randomized forecaster draws a 
compound action (J), ..., /,) and plays action /,. Denote by M the number of all compound 
actions with size bounded by m. If we use the randomized exponentially weighted forecaster 
over this set of all compound actions, then Corollary 4.2 implies that the tracking regret 
is bounded by po. Hence, it suffices to count the number of compound actions: 
for each k = 0,...,m there are Ce P ') ways to pick k time steps t = 1,...,n — 1 where a 
switch i; Æ i;41 occurs, and there are N(N — 1) ways to assign a distingh action to each 
of the k + 1 resulting blocks. This gives 


= n—1 k m+1 m 
M= k N(N - 1). <N exp | (n — 1)H a 
k=0 


where H(x) = —x lnx — (1 — x)In( — x) is the binary entropy function defined for x € 
[0, 1]. Substituting this bound in the above expression, we find that the tracking regret of 
the randomized exponentially weighted forecaster for compound actions satisfies 


Ronis f3 (m+ mN +0- Da (- a) 


on any action sequence i1, ..., 7, such that size (i1, ..., in) < m. 


The Fixed Share Forecaster 

In its straightforward implementation, the exponentially weighted average forecaster 
requires to explicitly manage an exponential number of compound actions. We now show 
how to efficiently implement a generalized version of this forecasting strategy that achieves 
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the same performance bound for the tracking regret. This efficient forecaster is derived 
from a variant of the exponentially weighted forecaster where the initial weight distribution 
is not uniform. The basis of this argument is the next simple general result. Consider the 
randomized exponentially weighted average forecaster defined in Section 4.2, with the only 
difference that the initial weights w; o assigned to the N actions are not necessarily uniform. 
Via a straightforward combination of the proof of Theorem 2.2 (see also Exercise 2.5) with 
the results of Section 4.2, we obtain the following result. 


Lemma 5.1. For alln > 1, if the randomized exponentially weighted forecaster is run using 
initial weights w1.0, ..., wN, = O such that Wo = w1, + -+-+ wyn o < 1, then 


SOK Vie eee 
’ SS 5n”, 
t=1 i ! n Wn 8 


where W, = Ee Win = a wi oe™” L'=1 tY) is the sum of the weights after n rounds. 


Nonuniform initial weights may be interpreted to assign prior importance to the different 
actions. The weighted average forecaster gives more importance to actions with larger 
initial weight and is guaranteed to achieve a smaller regret with respect to these experts. 

For the tracking application, we choose the initial weights of compound actions 
(i1, ..., in) so that their values correspond to a probability distribution parameterized by 
a real number a € (0, 1). We show that our efficient forecaster achieves the same track- 
ing regret bound as the (nonefficient) exponentially weighted forecaster run with uniform 
weights over all compound actions whose number of switches is bounded by a function 
of æ. 

We start by defining the initial weight assignment. Throughout the whole section we 
assume that the horizon n is fixed and known in advance. We write w/(i1, . . . , in) to denote 
the weight assigned at time ¢ by the exponentially weighted forecaster to the compound 
action (i1, ..., in). For any fixed choice of the parameter a € (0, 1), the initial weights of 
the compound actions are defined by 


iat : 1 g N Size (i1,-..4n) g \ Size (i1,..-44n) 
wold, «sind = = (=) (1-a+ 5) : 


Introducing the “marginalized” weights 


wines id= Do won. sinii sin) 


for all t = 1,...,n, it is easy to see that the initial weights are recursively computed as 
follows: 
wo(i1) = 1/N 


; ? ; . a 
wolis oo iep) = wolits i) (F += alien) - 


Note that, for any n, this assignment corresponds to a probability distribution over 


{1,..., Ny", the set of all compound actions of length n. In particular, the ratio 
wo(i1,--+»fr41)/Woli,.--, i+) can be viewed as the conditional probability that a ran- 
dom compound action (/;, ..., In), drawn according to the distribution wọ» has l1 = i¢41 


given that Z, = i,. Hence, wọ is the joint distribution of a Markov process over the set 
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{1,..., N} such that 7; is drawn uniformly at random, and each next action /,,; is equal 
to the previous action /, with probability 1 — a + a@/N, and is equal to a different action 
J #1, with probability œ/N . Thus, choosing œ small amounts to assigning a small initial 
weight to compound actions with a large number of switches. 

At any time t, a generic weight of the exponentially weighted forecaster has the form 


t 
wi(it, rr) in) = wolii, e.. in) exp (- X liis, ro) 
s=l 


and the forecaster draws action i at time f+ 1 with probability w; ,/W/, where W/ = 
Wi tooo + Wy, and 
i r BP fons F ; 1 
Wir = Je wii, 2-65 ir, i, beta, +++ 5 in) fort > 1 and Wio= N 


We now define a general forecasting strategy for running efficiently the exponentially 
weighted forecaster with this choice of initial weights. 


THE FIXED SHARE FORECASTER 
Parameters: Real numbers 7 > 0 and0 <a < 1. 
Initialization: wọ = (1/N,...,1/N). 
For each round t = 1, 2,... 
(1) draw an action J, from {1,..., N} according to the distribution 


Wi,t-1 ; 
Pit = SN keka NS 


Žij W jti 
(2) obtain Y, and compute 
Vit = Wist-l e 7 tY) for each i = 1,..., N. 
(3) let 
W, 
Wig =o HU- avis for each i = 1,...,N, 


where W, = vis +--+ + VN. 


Note that with œ = 0 the fixed share forecaster reduces to the simple exponentially weighted 
average forecaster over N base actions. A positive value of œ forces the weights to stay 
above a minimal level, which allows to track the best compound action. We will see shortly 
that if the goal is to track the best compound action with at most m switches, then the right 
choice of a is about m/n. 

The following result shows that the fixed share forecaster is indeed an efficient version 
of the exponentially weighted forecaster. 


Theorem 5.1. For alla € [0, 1], for any sequence of n outcomes, and for allt = 1,...,n, 
the conditional (given the past) distribution of the action I,, drawn at time t by the fixed 
share forecaster with input parameter a, is the same as the conditional distribution of 
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action I} drawn at time t by the exponentially weighted forecaster run over the compound 
actions (i1, ..., in) using initial weights wo(i1, ...,in) Set with the same value of a. 


Proof. It is enough to show that, for alli and t, w; = w; ,. We proceed by induction on 
t. For t = 0, wio = wio = 1/N for all i. For the induction step, assume that w; s = w; , 
for alli and s < t. We have 


r Mi PSE n) 
Wi, = w,Cii,..-, ir, i, it42,.--,İn 


t . . y 
. Ire . woli se.’ l i) 
KD (5ta ro) wolin -o i) e 


= wolii, ..., dr) 


= ` exp (-» S kis, ro) whi., i) (S +0 = alt=) 


s=l 


(using the recursive definition of Wo) 


abt a 
ay A ee (7 l= ai=) 


; a 
= >D eH Wig 4 (= ater ai=) 
(by the induction hypothesis) 
= > Vint (= +(- aly, =i) (using step 2 of fixed share) 


= Wit (using step 3 of fixed share). W 


Note that n does not appear in the proof, and the choice of n is thus immaterial in the 
statement of the theorem. Hence, unless œ and n are chosen in terms of n, the prediction 
at time ¢ of the exponentially weighted forecaster can be computed without knowing the 
length of the compound actions. 

We are now ready to state the tracking regret bound for the fixed share forecaster. 


Theorem 5.2. For all n > 1, the tracking regret of the fixed share forecaster satisfies 


no . m+1 1 1 n 
R(@,..-,in) < InN + -In rtg” 
n n AGEN YO a ee 8 
for all action sequences i,,...,in, where m = size (i1, ..., in). 
We emphasize that the bound of Theorem 5.2 is true for all sequences i1, ..., in, and the 
bound on the regret depends on size (i,,...,7,), the complexity of the sequence. If the 


objective is to minimize the tracking regret for all sequences with size bounded by m, 
then the parameters œ and ņ can be tuned to minimize the right-hand side. This is shown 
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in the next result. We see that by tuning the parameters œ and 7 we obtain a tracking 
regret bound exactly equal to the one proven for the exponentially weighted forecaster 
run with uniform weights over all compound actions of complexity bounded by m. The 
good choice of a turns out to be m/(n — 1). Observe that with this choice the compound 
actions with m switches have the largest initial weight. In fact, it is easy to see that 
the initial weight distribution is concentrated on the set of compound actions with about 
m switches. This intuitively explains why the next performance bound matches the one 
obtained for the exponentially weighted average algorithm run over the full set of compound 
actions. 


Corollary 5.1. For all n,m such that 0 < m < n, if the fixed share forecaster is run with 
parameters a = m/(n — 1), where form = 0 we let œ = 0, and 


8 m 
n=,f/-((m4+)nN+(—-— DH | — |}, 
n n—l1 
a n m 
Ry, .--,in) <. |z (m+ IInN 4+ (nv — 1)H | — 
2 n—1 


for all action sequences i,,..., i, such that size (i1, ..., in) < m. 


then 


Proof. First of all, note that fora = m/(n — 1) 


1 m n—-m-—1 
n < —min (n—m — 1)In 
amd Er a)r=m-1 n—1 n— l1 


m 


Using our choice for 7 in the bound of Theorem 5.2 concludes the proof. W 


In the special case m = 0, when the tracking regret reduces to the usual regret, the 
bound of Corollary 5.1 is ./(1/2) In N, which is the bound for the exponentially weighted 
forecaster proven in Theorem 2.2. 


Proof of Theorem 5.2. Recall that for an arbitrary compound action i), ..., i, we have 
Inwi (i, -.-sin) =Inwoi,-.-sin) — 0 9 LG, Yo). 
t=1 


By definition of wo, if m = size (i,,..., in), 


i : = 1 eave +a ere 2 1 a y—m-1 
wolii,- ..,in NNN N a = Ww a ; 
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Therefore, using this in the bound of Lemma 5.1 we get, for any sequence (i;,...,7,) with 
size (i1, ..., in) = mM, 
> Lp, , Y,) 
t=1 
1 1 
EERE 
n W, 8 
1 1 
< 


In - — +n 
n wi Cii,..., in) 8 


IA 


: 1 m N n-m-t1 n 
X eis, Y,)+-InN+—In In(1 — @) + =n, 
4 n n q n 8 


which concludes the proof. W 


Hannan Consistency 

A natural question is under what conditions Hannan consistency for the tracking regret can 
be achieved. In particular, if we assume that at time n the tracking regret is measured against 
the best compound action with at most u(n) switches, we may ask what is the fastest rate 
of growth of u(n) a Hannan consistent forecaster can tolerate. The following result shows 
that a growth slightly slower than linear is already sufficient. 


Corollary 5.2. Let u : N > N be any nondecreasing integer-valued function such that 


n 
pe (mig mam) 


Then there exists a randomized forecaster such that 


1 n n 
lim sup — 2 EC, Y,) — min De Li, ro) =0 with probability 1, 


n 
n—>Ħo0 f=1 = 


where F, is the set of compound actions (i4, . . . , in) whose size is at most u(n). 


The proof of Corollary 5.2 goes along the same lines as the proof of Corollary 6.1, and we 
leave it as an exercise. 


The Variable Share Forecaster 

Recall that if the best action has a small cumulative loss, then improved regret bounds 
may be achieved that involve the loss of the best action (see Section 2.4). Next we show 
how this can be done in the tracking framework such that the resulting forecaster is still 
computationally feasible. This is achieved by a modification of the fixed share forecaster. 
All we have to do is to change the initial weight assignment appropriately. 

A crucial feature of the choice of initial weights wọ when forecasting compound actions 
is that for any i and t, the weight Wit depends only on wo(i1, ..., i, 7) and on the realized 
losses up to time f. That is, for computing the prediction at time f + 1 there is no need to 
know how wo(i1, ..., 1,7) is split into wo(i1,..., ir, È, i42,-.-, in) for each continuation 
ip42,++-,In E {1,..., N}. This fact, which is the key to the proof of Theorem 5.1, can 
be exploited to define wo(i1, ..-, ir, (+41) using information that is only made available at 
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time t. For example, consider the following recursive definition 


wolii,- -s ir41) 
1] — a) CY) 


é 5 1 —( i 
= wolii, ++i) (hii al ay" tes 


This is similar to the definition of the Markov process associated with the initial weight 
distribution used by the fixed share forecaster. The difference is that here we see that, given 
l, = i,, the probability of [;41; 4 i; grows with £(i;, Y,), the loss incurred by action i; at time 
t. Hence, this new distribution assigns a further penalty to compound actions that switch to 
a new action when the old one incurs a small loss. 

On the basis of this new distribution, we introduce the variable share forecaster, replac- 
ing step 3 of the fixed share forecaster with 


1 


Wi E 2! al ay Gy, + (1 — aty 


Mimicking the proof of Theorem 5.1, it is easy to check that the weights w; , of the variable 
share forecaster satisfy 


Tee . pF . 
Wit = ) Willies adroit aaeain) 
iy pees ip sito see in 


Hence, the computation carried out by the variable share forecaster efficiently updates the 


weights w; ,. 
Note that this initial weight distribution assigns a negligible weight wọ to all compound 
actions (i1, ...,7,) suchthati,,, Æ i, and €(i,, Y,)is close to 0 for some t. In Theorem 5.3 we 


analyze the performance of the variable share forecaster under the simplifying assumption 
of binary losses £ € {0, 1}. Via a similar, but somewhat more complicated, argument, a 
regret bound slightly larger than the bound of Theorem 5.3 can be proven in the general 
setup where £ € [0, 1] (see Exercise 5.4). 


Theorem 5.3. Fix a time horizon n. Under the assumption £ € {0, 1}, for all n > 0 and for 
alla < (N —1)/N the tracking regret of the variable share forecaster satisfies 


Yep, YD- Y LG, Y) 
t=1 t=1 


+1 
m+" Tint tine + i (S46, ¥0) im * -+n 
n 


for all action sequences i,,..., in, where m = size (i, ..., in). 


Observe that the performance bound guaranteed by the theorem is similar to that of 
Theorem 5.2 with the only exception that 


L TE T oe 5 ein YI ! 
n 1S replace cad lt, n 
fa p y 7 to {t T 


n = 


Because the inequality holds for all sequences with size (i4, . . . , in) < m, this is a significant 
improvement if there exists a sequence (i1, ..., in) with at most m switches, with m not 
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too large, that has a small cumulative loss. Of course, if the goal of the forecaster is to 
minimize the cumulative regret with respect to the class of all compound actions with 
size (i1, ..., in) < m, then the optimal choice of œ and 7 depends on the smallest such 
cumulative loss of the compound actions. In the lack of prior knowledge of the minimal 
loss, these parameters may be chosen adaptively to achieve a regret bound of the desired 
order. The details are left to the reader. 


Proof of Theorem 5.3. In its current form, Lemma 5.1 cannot be applied to the initial 
weights wọ of the variable share distribution because these weights depend on the outcome 
sequence Y;,..., Y, that, in turn, may depend on the actions of the forecaster. (Recall that in 
the model of randomized prediction we allow nonoblivious opponents.) However, because 
the draw 7, of the variable share forecaster is conditionally independent (given the past 
outcomes Y,,..., Y;_;) of the past random draws /,,..., /;-;, Lemma 4.1 may be used, 
which states that, without loss of generality, we may assume that the opponent is oblivious. 
In other words, it suffices to prove our result for any fixed (nonrandom) outcome sequence 
Y1, ---, Yn. Once the outcome sequence is fixed, the initial weights are well defined and we 
can apply Lemma 5.1. 

Introduce the notation L(j,,..., jn) = (ji, Y1) + -- -+ Gin, Yn). Fix any compound 
action (i1, ..., in). Letm = size (ij,...,i,) and L* = L(ij,...,i,). If m = 0, then 

InW/ > Inw/(i,...,in) = ln (sea = a") 

and the theorem follows from Lemma 5.1. Assume then m > 1. Denote by F;+,,, the set 
of compound actions (j;,..., j1) with cumulative loss L(j;,..., jn) < L* +m. Then 


InW, = In >» Wiis ao, Jade E 
Gij )El,. NP 


AD A 
Gis- Jn)EFL* 4m 
= —n(L* +m) + ln DOO WU ja) 
Gis- in EFL * +m 
We now show that F;++;,, contains at least a compound action (j1, ..., jn) with a large 
weight wolit. ..+5 Jn). This compound action (j1,..., jn) mimics (i1, ..., in) until the 


latter makes a switch. If the switch is made right after a step t where £(i+, y;) = 0, then 
(ji, ---, Jn) delays the switch (which would imply wo(j1, ..., jn) = 0) and keeps repeating 
action i, until some later time step t’ where (i,, y/) = 1. Then, from time ¢’ + 1 onward, 


(Jis ---, Jn) Mimics (i1, ..., in) again until the next switch occurs. 
We formalize the above argument as follows. Let ft; be the time step where the first switch 
i, A in41 occurs. Set j, = i, forallt = 1, ..., tı. Thenset j, = in forallt = t + 1,..., ti, 


where f; is the first step after tı such that €(i;,, yn) = 1. If no such f; exists, then let t; = n. 
Proceed by setting j; = i; for all t = ti + 1,¢,+2,... until anew switch i, A i,,41 occurs 
for some f > t; (if there are no more switches after t|, then set j; =i, for all t = ti + 
1,..., n). Repeat the procedure described above until the end of the sequence is reached. 
Call a sequence of steps t} + 1,..., te+1 — 1 (where t, may be 1 and tz+}ı — 1 may be n) 
an A-block, and a sequence of steps %,...,t; (where t; may be n) a B-block. Note that 
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(ji,--+; Jn) never makes a switch within an A-block and makes at most one switch 
within each B-block. Also, L(ji,..., jn) < L* +m, because j; = i, within A-blocks, 
Ln. +++ Ji) < 1 within each B-block, and the number of B-blocks is at most m. 
Introduce the notation 
1 — (1 — a) 


a= wo eaa FA = OO Mair) 


By definition of wọ we have 

n—1 

wG «+++ dn) = wo) [ 2. 

t=1 
For all ¢ in an A-block, Q, = (1 — a)!” Now fix any B-block tz, ..., t}. Then £(j;, yr) = 
Oforallt = t%,...,% — Land €(jy, yy) < 1, as (jy, Yy) might be 0 when t, = n. We thus 
have 
1—(-a)! gQ 

N-1 N-I 


[] %20- x xa- a x 
teB-block 
The factor (1 —(1— a)')/(N — 1) appears under the assumption that jy <n and iti Æ 
Jy. Wigs, = in, implying jy+1 = jy, then Qy = (1 — a)! and the above inequality still 
holds since a/(N — 1) < 1 — « is implied by the assumption œ < (N — 1)/N. 
Now, as explained earlier, 


Yili yw SL" and $ eryr) Sm, 
t t' 


where the first sum is over all t in A-blocks and the second sum is over all ¢’ in B-blocks 
(recall that there are at most m B-blocks). Thus, there exists a (j1,.--, Jn) © FL*+m with 


n-1 m 
1 * a 
Wo(J1, oh) = won| | Q, > —(1 — a)" ) ; 
a N N-1 


Hence, 


1 i m 
nX Wii = nC a)’ (7) J: 


Fret 


which concludes the proof. W 


5.3 Tree Experts 


An important example of structured classes of experts is obtained by representing experts 
(i.e., actions) by binary trees. We call such structured actions tree experts. A tree expert E 
is a finite ordered binary tree in which each node has either 2 child nodes (a left child and 
a right child) or no child nodes (if it is a leaf). The leaves of E are labeled with actions 
chosen from {1,..., N}. We use A to indicate the root of E and 0 and 1 to indicate the 
left and right children of à. In general, (x1, ..., xa) € {0, 1}¢ denotes the left (if x7 = 0) or 
right (if x7 = 1) child of the node (41, ..., Xg-1). 

Like the frameworks studied in Chapters 9, 11, and 12, we assume that, at each pre- 
diction round, a piece of “side information” is made available to the forecaster. We do 
not make any assumptions on the source of this side information. In the case of tree 
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(a) (b) 


Figure 5.1. A binary tree (a) and a tree expert based on it (b). Given any side information vector 
with prefix (1, 0, ...), this tree expert chooses action 4, the label of leaf (1, 0). 


experts the side information is represented, at each time step t, by an infinite binary 
vector X; = (X1,t, X2,:,...). In the example described in this section, however, we only 
make use of a finite number of components in the side information vector. 

An expert E uses the side information to select a leaf in its tree and outputs the action 
labeling that leaf. Given side information x = (x1, X2,...), let v = (v1, ..., vq) be the 
unique leaf of E such that vı = x1, . . ., Va = Xq. The label of this leaf, denoted by i¢(x) € 
{1,..., N}, is the action chosen by expert E upon observation of x (see Figure 5.1). 


Example 5.1 (Context trees). A simple application where unbounded side information 
occurs is the following: consider the set of actions {0, 1}, let Y = {0, 1}, and define £ by 
£(i, Y) = Igzy}. This setup models a randomized binary prediction problem in which the 
forecaster is scored with the number of prediction mistakes. Define the side information 
at time t by x, = (Y;-1, Y;-2,...), where Y, for t < 0 is defined arbitrarily, say, as Y, = 0. 
Hence, the leaf of a tree expert E determining the expert’s prediction i¢(x;,) is selected 
according to a suffix of the outcome sequence. However, depending on EF, the length of the 
suffix used to determine i¢(x,) may vary. This model of interaction between the outcome 
sequence and the expert predictions was originated in the area of information theory. Suffix- 
based tree experts are also known as prediction suffix trees, context trees, or variable-length 
Markov models. 


We define the (expected) regret of a randomized forecaster against the tree expert E by 


Ren = X Ep, Y) — >) Ce), Y)- 


t=1 t=1 


In this section we derive bounds for the maximal regret against any tree expert such that the 
depth of the corresponding binary tree (i.e., the depth of the node of maximum distance from 
the root) is bounded. As in Section 5.2, this is achieved by a version of the exponentially 
weighted average forecaster. Furthermore, we will see that the forecaster is easy to compute. 

Before moving on to the analysis of regret for tree experts, we state and prove a general 
result about sums of functions associated with the leaves of a binary tree. We use T to denote 
a finite binary tree. Thus, a tree expert E is a binary tree T with an action labeling each 
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Figure 5.2. A fragment of a binary tree. The dashed line shows a subtree Ty rooted at v. 


leaf. The size ||T || of a finite binary tree T is the number of nodes in T. In the following, 
we often consider binary subtrees rooted at arbitrary nodes v (see Figure 5.2). 


Lemma 5.2. Let g : {0, 1}* — R be any nonnegative function defined over the set of all 
binary sequences of finite length, and introduce the function G : {0, 1}* —> R by 


Ewa 2k J| Se, 
Ty 


x€leaves(Ty) 
where the sum is over all finite binary subtrees Ty rooted at ¥ = (vı, ..., Va). Then 


G(v y= om =G(vy1,...,Va,0)G(1,..., Vd, 1). 


Proof. Fix any node v = (11,..., vq) and let To and T; range over all the finite binary 
subtrees rooted, respectively, at (v1, . . . , Va, 0) and(v1,..., va, 1). The leaves of any subtree 
rooted at v can be split in three subsets: those belonging to the left subtree To of v, those 
belonging to the right subtree Tı of v, and the singleton set {v} belonging to the subtree 
containing only the leaf v. Thus we have 


ow = 8 ee anD TT gœ I] s®) 


xcleaves(To) x’Eleaves(T)) 


= 8) 5 ee 27 lITol I] g(x) > 27 MIT I] (x) 


xcleaves(To) Ti x’Eleaves(T)) 


gy)... 1 
= BEV bes .., Va, OG (v1, ..., Va, 1). a 
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Our next goal is to define the randomized exponentially weighted forecaster for the set 
of tree experts. To deal with finitely many tree experts, we assume a fixed bound D > 0 
on the maximum depth of a tree expert (see Exercise 5.8 for a more general analysis). 
Hence, the first D bits of the side information x = (x1, x2,...) are sufficient to determine 
the predictions of all such tree experts. We use depth(v) to denote the depth d of a node 
v= (v1, ..., Vq) ina binary tree T, and depth(T ) to denote the maximum depth of a leaf 
in T. (We define depth(A) = 0.) Because the number of tree experts based on binary trees 
of depth bounded by D is N 2” (see Exercise 5.10), computing a weight separately for 
each tree expert is computationally infeasible even for moderate values of D. To obtain an 
easily calculable approximation, we use, just as in Section 5.2, the exponentially weighted 
average forecaster with nonuniform initial weights. In view of using Lemma 5.1, we just 
have to assign initial weights to all tree experts so that the sum of these weights is at most 
1. This can be done as follows. 

For D > 0 fixed, define the D-size of a binary tree T of depth at most D as the number 
of nodes in T minus the number of leaves at depth D: 


IIT lb = IIT || — |{v € leaves(T) : depth(v) = D}}. 


Lemma 5.3. For any D > 0, 
27lT lb — 1, 
T : depth(T)<D 


where the summation is over all trees rooted at i, of depth at most D. 


Proof. We proceed by induction on D. For D = 0 the lemma holds because ||T |o = 0 
for the single-node tree {A} of depth 0. For the induction step, fix D > 1 and define 


0 if depth(v) > D 
g(v)= 42. if depth(v) = 
1 otherwise. 


Then, for any tree T, 


f 0 if depth(T) > D 
ITI = 
2 I] g(x) = | 27lTlo if depth(T) < D. 


xcleaves(T ) 
Applying Lemma 5.2 with v = à we get 
=ITllb — -IT| — a ) 
2 DP I] =% + GOGO. 


T :depth(T)<D T xcleaves(T) 


Since à has depth 0 and D > 1, g(A) = 1. Furthermore, letting Tọ and T; range over the 
trees rooted at node 0 (the left child of à) and node 1 (the right child of à), for each b € {0, 1} 
we have 


G(b) = > 27 liT, I] g(x) = > 7 MIT ilo — 1 


T, : depth(T;)<D—1 x€leaves(T),) T :depth(T)<D-1 


by induction hypothesis. W 
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Using Lemma 5.3, we can define an initial weight over all tree experts E of depth at 
most D as follows: 


Weo= 27 lE llo yy —lleaves(E)| 


where, in order to simplify notation, we write || E || p and leaves(£) for ||T || p and leaves(T ) 
whenever the tree expert E has T as the underlying binary tree. (In general, when we speak 
about a leaf, node, or depth of a tree expert, we always mean a leaf, node, or depth of the 
underlying tree T.) The factor N~!'**ves) in the definition of wg, accounts for the fact 
that there are N''*ves()| tree experts E for each binary tree T. 

Ifu = (u1, .. . , Ug) isaprefix of v = (u1,...,Ug,...,Up), we write u E v. In particular, 
uC vifuC v and v has at least one extra component vai; with respect to u. 

Define the weight wz ;_1 of a tree expert E at time t by 


WE,t-1 = WE,0 I] WE,v,t-1> 
veleaves(E) 


where w ¢.y,+-1, the weight of leaf v in E, is defined as follows: wz.y.9 = 1 and 


ve [rene MO itv E x 
E, aoe . 
Mt WE vi-l otherwise, 


where ig (v) = if(X;) is the action labeling the leaf v E x, of E. (Note that v is unique.) As 
no other leaf v' 4 v of E is updated at time t, we have wey = WE,v,t—1 for all such v’, 
and therefore 


Wet = Weye ee) for all t =1,2,... 


as v E x; always holds for exactly one leaf of E. At time t, the randomized exponentially 
weighted forecaster draws action k with probability 


X p lupak W E,- 

E Hie =k)" E,t-1 

Pk,t z = ’ 
X E'WE',t—1 


where the sums are over tree experts E with depth(Z) < D. Using Lemma 5.1, we imme- 
diately get the following regret bound. 


Theorem 5.4. Foralln > 1, the regret of the randomized exponentially weighted forecaster 
run over the set of tree experts of depth at most D satisfies 


= E leaves(E 
pp, < Vel na 4 Heaven 4 lp 
n n 8 
for all such tree experts E and for all sequences X,..., Xn of side information. 


Observe that if the depth of the tree is at most D, then ||E]|p < 2? — 1 and |leaves(E)| < 
2°, leading to the regret bound 


D 


> 2 
max Ren < — In(2N) + Ty 
E : depth(E)< D n 8 


Thus, choosing 7 to minimize the upper bound yields a forecaster with 


Ren < ¥n2)-!InQN). 


max 
E : depth(Z)<D 


114 Efficient Forecasters for Large Classes of Experts 


We now show how to implement this forecaster using N weights for each node of the 
complete binary tree of depth D; thus N(2?*! — 1) weights in total. This is a substantial 
improvement with respect to using a weight for each tree expert because there are N 2p 
tree experts with corresponding tree of depth at most D (see Exercise 5.10). The efficient 
forecaster, which we call the tree expert forecaster, is described in the following. Because 
we only consider tree experts of depth D at most, we may assume without loss of generality 
that the side information x, is a string of D bits, that is, x, € {0, 1}? . This convention 
simplifies the notation that follows. 


THE TREE EXPERT FORECASTER 
Parameters: Real number 7 > 0, integer D > 0. 


Initialization: W; vo = 1, wiv, = 1 for each i = 1,...,N and for each node v = 
(y1,..-,Va) Withd < D. 


For each round t = 1,2,... 


(1) draw an action Z, from {1,..., N } according to the distribution 
pe ee ; i=1,...,N; 
ae 
(2) obtain Y, and compute, for each v and for each i = 1,..., N, 


wie, a [Mime NO” ifvex, 
Hab Ss Wiva-l otherwise; 
(3) recursively update each node v = (v1, ..., va) with d = D, D—1,...,0 
1 


an Wi,v.t ifv=x, 
1 N : 
IN Dajani ivi if depth(v) = D 
= and v Æ x, 
Wit = 1 tees P i B 
ZN Wiv,t + aw (Wi.vo,t + Wivi.) ifvC X 
Wi,v,t-1 if depth(v) < D 
and v Z x; 
where Vo = (11,..., Vg, 0) and vi = (v1, ..., Vg, 1). 


With each weight w; v the tree expert forecaster associates an auxiliary weight W;,y,. As 
shown in the next theorem, the auxiliary weight W; ,,, equals 


Yo WicoaiWe.r- 
E 


The computation of w; v, and W; v, given Wi v -1 and W;,y,;-1 for each v, i can be carried 
out in O(D) time using a simple dynamic programming scheme. Indeed, only the weights 
associated with the D nodes v E x, change their values from time t — 1 to time t. 


Theorem 5.5. Fix a nonnegative integer D > 0. For any sequence of outcomes and for 
all t > 1, the conditional distribution of the action I, drawn at time t by the tree expert 
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forecaster with input parameter D is the same as the conditional distribution of the action 
I; drawn at time t by the exponentially weighted forecaster run over the set of all tree 
experts of depth bounded by D using the initial weights w g£ o defined earlier. 


Proof. We show that for each t = 0, 1,... and for each k = 1,..., N, 


De Tie =k WE, -1 = Wea 
E 


where the sum is over all tree experts E such that depth(Z) < D. We start by rewriting the 
left-hand side of the this expression as 


I] WE,v,t—-1 


veleaves(E) 


I WE,v,t—1 
N 


veleaves(E) 


Yo Tice yew er-1 = yo Ellp N ~leaves(Z) Itupa =k} 
E E 


D 
E 


P Tice =e 


d : 
= Yz T \lp > I] ee ly cx, i ;= } 
T j 


where in the last step we split the sum over tree experts E in a sum over trees T and a 
sum over all assignments (i1, ..., iq) € {1,..., N} of actions to the leaves vj, . 
T. The indicator function selects those assignments (i1, . 


.., Va of 
. . , İq) such that the unique leaf 


d N 
= Wi,v;,t-1 
` Lirak) WE, -1 = ) 2 ire J] ) ay usish 
E T j=l i=l 


Note that this last expression is of the form 


N 
= Wi,v,t—1 
Xoz J| gam fo gaw s J MA icsi 
T veleaves(T ) i=1 
Let 
Gaw=) 24) J| ea@, 
Ty x€leaves(Ty) 


where Ty ranges over all trees rooted at v. By Lemma 5.2 (noting that the lemma remains 


true if ||7y || is replaced by ||Ty||p), and using the definition of g,_;, for all v = (1,.. 


1 
ZN Wk,v,t-1 


1 oN 
oN Ži Wi,y;,t-1 


G,-1(v) = 


., Vd), 
ifv = x, 


if depth(v) = D 
and v Æ x; 


gyWhv.t-1 + 3(Gi-1(¥o) + Gi-1(v1)) otherwise 
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where Vo = (¥1,..., Va, 0) and vı = (11,..., Va, 1). We now prove, by induction ont > 1, 
that G;_1(A) = Wx.a7-1 for all t. For t = 1, Wk,a,0 = 1. Also, 
1 & 
= -IIT I ra i = 
Go) = 2 "a Da =1 


veleaves(T ) i 


because w; yo = 1 for all i and v, and we used Lemma 5.3 to sum 27!'7llo, Assuming the 
claim holds fort — 1, note that the definition of G;(v) and w;.y,, is the same for the cases v = 
X;, V Æ X, and v C x,. For the case v Z x, with depth(v) < D, we have W; y, = Wj.y,1-1. 
Furthermore, w; v, = W;,y,;-1 for all. Thus G,(v) = G;,_;(v) as all further v’ involved in 
the recursion for G,(v) also satisfy v' Z x,. By induction hypothesis, G,_1(v) = Wi.y,r-1 
and the proof is concluded. W 


5.4 The Shortest Path Problem 


In this section we discuss a representative example of structured expert classes that has 
received attention in the literature for its many applications. Our purpose is to describe 
the main ideas in the simplest form rather than to offer an exhaustive account of online 
prediction problems for which computationally efficient solutions exist. The shortest path 
problem is the ideal guinea pig for our purposes. 

Consider a network represented by a set of nodes connected by edges, and assume that 
we have to send a stream of packets from a source node to a destination node. At each 
time instance a packet is sent along a chosen route connecting source and destination. 
Depending on traffic, each edge in the network may have a different delay, and the total 
delay the packet suffers on the chosen route is the sum of delays of the edges composing 
the route. The delays may vary in each time instance in an arbitrary way, and our goal is to 
find a way of choosing the route in each time instance such that the sum of the total delays 
over time is not much more than that of the best fixed route in the network. 

Of course, this problem may be cast as a sequential prediction problem in which each 
possible route is represented by an expert. However, the number of routes is typically expo- 
nentially large in the number of edges, and therefore computationally efficient predictors 
are called for. In this section we describe two solutions of very different flavor. One of them 
is based on the follow-the-perturbed-leader forecaster discussed in Section 4.3, whereas the 
other is based on an efficient computation of the exponentially weighted average forecaster. 
Both solutions have different advantages and may be generalized in different directions. 
The key for both solutions is the additive structure of the loss, that is; the fact that the delay 
corresponding to each route may be computed as the sum of the delays of the edges on the 
route. 

To formalize the problem, consider a (finite) directed acyclic graph with a set of edges 


E = {e,,..., ejg} and set of vertices V. Thus, each edge e € E is an ordered pair of vertices 
(v1, v2). Let u and v be two distinguished vertices in V. A path from u to v is a sequence 
of edges e),...,e such that e® = (u, vı), e = (vj-1, vj) for all j =2,...,k -1, 


and e® = (v,_1, v). We identify a path with a binary vector i € {0, 1}!#! such that the jth 
component of i equals 1 if and only if the edge e; is in the path. For simplicity, we assume 
that every edge in E is on some path from u to v and every vertex in V is an endpoint of an 
edge (see Figure 5.3 for examples). 
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Figure 5.3. Two examples of directed acyclic graphs for the shortest path problem. 


In each round t = 1,...,n of the forecasting game, the forecaster chooses a path I, 
among all paths from u to v. Then a loss £e, € [0, 1] is assigned to each edge e € E. 
Formally, we may identify the outcome Y, with the vector £, € [0, 1]!4! of losses whose jth 
component is £e,,+. The loss of a path i at time ¢ equals the sum of the losses of the edges 
on the path, that is, 


L(i, Y.) =i- 4. 


Just as before, the forecaster is allowed to randomize and to choose I, according to the 
distribution p, over all paths from u to v. We study the “expected” regret 


> &@,, Y) — min) | eG. Y,), 
t=1 t=1 


where the minimum is taken over all paths i from u to v and E(P, Y)= > Pit, Y;). 
Note that the loss £(i, Y,) is not bounded by 1 but rather by the length K of the longest path 
from u to v. Thus, by the Hoeffding—Azuma inequality, with probability at least 1 — ô, the 
difference between the actual cumulative loss paar £(1;, Y;) and ey Lp, Y,) is bounded 
by K /(n/2) In(1/8). 

We describe two different computationally efficient forecasters with similar performance 
guarantees. 


Follow the Perturbed Leader 
The first solution is based on selecting, at each time fr, the path that minimizes the “perturbed” 
cumulative loss up to time t — 1. This is the idea of the forecaster analyzed in Section 4.3 
that may be adapted to the shortest path problem easily as follows. 

Let Z,,..., Z, be independent, identically distributed random vectors taking values in 
R'“!, The follow-the-perturbed-leader forecaster chooses the path 


t—1 
I, = argmin i . (x l, + z.) . 
À s=1 


Thus, at time ¢, the cumulative loss of each edge e j is “perturbed” by the random quantity 
Z jı and I; is the path that minimizes the perturbed cumulative loss over all paths. Since 
efficient algorithms exist for finding the shortest path in a directed acyclic graph (for acyclic 
graphs linear-time algorithms are known), the forecaster may be computed efficiently. The 
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following theorem bounds the regret of this simple algorithm when the perturbation vectors 
Z, are uniformly distributed. An improved performance bound may be obtained by using 
two-sided exponential distribution instead (just as in Corollary 4.5). The straightforward 
details are left as an exercise (see Exercises 5.11 and 5.12). 


Theorem 5.6. Consider the follow-the-perturbed-leader forecaster for the shortest path 
problem such that the vectors Z, are distributed uniformly in (0, A]'*!. Then, with proba- 
bility at least 1 — ô, 


z eee nK |E| n 1l 
J edi, Y,) -min 2G, Y) < KA+ ae a 
t=1 


t=1 


where K is the length of the longest path from u to v. With the choice A = J/n\|E| the upper 
bound becomes 2K /n|E| + K./(n/2) In(1/8). 


Proof. The proof mimics the arguments of Theorem 4.2 and Corollary 4.4: by Lemma 4.1, 
it suffices to consider an oblivious opponent and to show that in that case the expected regret 
satisfies 

nK |E| 


J (L, — mi ti, <KA X 
2a oD Gy) <sKA+— 


As in the proof of Theorem 4.2, we define the fictitious forecaster 


t 
I= argmini - > ts + z.) ; 
i s=l 


which differs from I; in that the cumulative loss is now calculated up to time f (rather than 
t — 1). Of course, T cannot be calculated by the forecaster because he uses information 
not available at time ¢: it is defined merely for the purpose of the proof. Exactly as in 
Theorem 4.2, one has, for the expected cumulative loss of the fictitious forecaster, 


Dye, yi) < min X eG, y:) + Emax (i - Z,) + Emax (~i - Z1). 


t=1 t=1 


Since the components of Z; are assumed to be nonnegative, the last term on the right-hand 
side may be dropped. On the other hand, 


£ max (i- Z,) < max lili E||Z,|lo < KA, 
i i 


where K = max; |lil|; is the length of the longest path from u to v. It remains to compare 
the cumulative loss of I, with that of I,. Once again, this may be done just as in Theorem 
4.2. Define, for any z € R'#!, the optimal path 


i*(z) = argmini- z 
i 


and the function 


t-1 
F(a) =i" (x l, + r) L. 


s=l 
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Denoting the density function of the random vector Z, by f(z), we clearly have 


L, y) = lz F,(z)f(z)dz and ne, y) = A F,(z)f (z — £,) dz. 
R R 
Therefore, for each t = 1,..., n, 


zeA, y) — EL, yy) 


t—1 


=4- is i* (x l+ z) (f(z) — f(z- &)) dz 
t—1 
Sh. i, i Ls +z | f(z)dz 
(z: f@>fe-£)} (x f 


s=l 


t—1 
i* ts +z | f(z)dz 
a (x 


s=l 
<K 1 f(z) dz 
{z: f@)> f(Z—£,)} 


(since ||£& lle < 1 and all paths have length at most K) 
K\E| 

< ee. 

TA 


where the last inequality follows from the argument in Corollary 4.4. W 


< Ilé llo 


1 


Theorem 5.6 asserts that the follow-the-perturbed-leader forecaster used with uni- 
formly distributed perturbations has a cumulative regret of the order of K/n|E]. 
By using two-sided exponentially distributed perturbation, an alternative bound of the 
order of ./L*/E|K In[E| < K./n|E|In|E| may be achieved (Exercise 5.11) or even 
VL*\|E{InM <./nK|E|InM, where M is the number of all paths in the graph lead- 
ing from u to v (Exercise 5.12). Clearly, M < (z9; but in concrete cases (e.g., in the 
two examples of Figure 5.3) M may be significantly smaller than this upper bound. These 
bounds can be improved to Kvnln M by a fundamentally different solution, which we 
describe next. 


Efficient Computation of the Weighted Average Forecaster 
A conceptually different way of approaching the shortest path problem is to consider each 
path as an action (expert) and look for an efficient algorithm that computes the exponentially 
weighted average predictor over the set of these experts. 

The exponentially weighted average forecaster, calculated over the set of experts given 
by all paths from u to v, selects, at time ¢, a path I, randomly, according to the probability 
distribution 


e ae ifs 
eae 
where 7 > 0 is a parameter of the forecaster and P, denotes the conditional probability 
given the past actions. 


In the remaining part of this section we show that it is possible to draw the random path 
I, in an efficiently computable way. The main idea is that we select the edges of the path one 


PU =i] = Pit = 
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by one, according to the appropriate conditional distributions generated by the distribution 
over the set of paths given above. 

With an abuse of notation, we write e € iifthe edgee € E belongs to the path represented 
by the binary vector i. Then observe that for any t = 1, ..., n and path i, 


i- 2, = bes 


eci 


and therefore the cumulative loss of each expert i takes the additive form 


t 
Xi : ts = Nkan 
s=1 


eci 


where Le s = we 1 £e,s is the loss accumulated by edge e during the first £ rounds of the 
game. 

For any vertex w € V, let P,, denote the set of paths from w to v. To each vertex w and 
t= 1,...,n, we assign the value 


Gy ye eee eo, 


icP,, 


First observe that the function G,(w) may be computed efficiently. To this end, assume that 
the vertices v1, ..., vıy; of the graph are labeled such thatu = v1,v = vjy|,andifi < j,then 
there is no edge from z; to z;. (Note that such a labeling can be found in time O(|£])). Then 
G,(v) = 1, and once G,(v;) has been calculated for all v; withi = |V|, |V — 1|,..., 7 +1, 
then G,(v j) is obtained, recursively, by 


G,(v;) — 5 Giwe Mo, 


w :(vj,w)EE 


Thus, at time ¢, all values of G,(w) may be calculated in time O(|F}). 

It remains to see how the values G;_;(w) may be used to generate a random path I, 
according to the distribution prescribed by the exponentially weighted average forecaster. 
We may draw the path I, by drawing its edges successively. Denote the kth vertex along a 
path i € P, by viz fork =0,1,..., Ki, where vio = u, vin, = v, and Kj = |lil|; denotes 
the length of the path i. For each path i, we may write 


Ki 


Pis = PL = i] = I] P, [ve = Vik | Vr -1 = Vik-1 -+ -o VE,0 = Vio]. 
k=l 


To generate a random path I, with this distribution, it suffices to generate the vertices vy, x 
successively for k = 1, 2,... such that the product of the conditional probabilities is just 
Pit: 

Next we show that, for any k, the probability that the kth vertex in the path I, is vix, 
given that the previous vertices in the graph are vio, ..., Vix—1, iS 


P, [vrk = Vie | Vp k1 = Vii -s V0 = Vio] 
Gi-10ix) 


= 4 Gii(iz-1) 
0 otherwise. 


if (Vix-1, Vik) E€ E 


5.5 Tracking the Best of Many Actions 121 


To see this, just observe that if (viz-1, Vik) E€ E, 


P, [vie = Vix | Wet = VIS | 
—1 Yee Let-l 
= D jep, , e g _ G;-10i,x) 
D jep, en Leet Lei Grig) 


as desired. Summarizing, we have the following. 


Theorem 5.7. The algorithm described above computes the exponentially weighted average 

forecaster over all paths between vertices u and v in a directed acyclic graph such that, at 
each time instance, the algorithm requires O (|E |) operations. The regret is bounded, with 
probability at least 1 — 6, by 


f woe , lnM nn n 1 
Jea, Y,) — min X | eG, Y,) < K 7 ee zits f 
t=1 


t=1 


where M is the total number of paths from u to v in the graph and K is the length of the 
longest path. 
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The purpose of this section is to develop efficient algorithms to track the best action in the 
case when the class of “base” experts is already very large but has some structure. Thus, 
in a sense, we consider a combination of the problem of tracking the best action described 
in Section 5.2 with predicting as well as the best in a large class of experts with a certain 
structure, such as the examples described in Sections 5.3 and 5.4. 

Our approach is based on a reformulation of the fixed share tracking algorithm that 
allows one to apply it, in a computationally efficient way, over some classes of large and 
structured experts. We will illustrate the method on the problem of “tracking the shortest 
path.” 

The main step to this direction is an alternative expression of the weights of the fixed 
share forecaster. 


Lemma 5.4. Consider the fixed share forecaster of Section 5.2. For any t =2,...,n, the 
probability pi, and the corresponding normalization factor W,—, can be obtained as 
t-1 
Pit = Tai e7" Esai LY.) 
' NW,-1 
a g 1 a 
4+ 1 —a tt W, ent LG,Ys) pt 
NW, ( ) -1 + N 
"=2 
tl t-2 
_ a nats (1—a@) 
Wii = N ae — a) Wr-1Zr -1 + ~p oe 


SEAT 
where Zy 4-1 = Ei e Ls tY) is the sum of the (unnormalized) weights assigned to 
the experts by the exponentially weighted average forecaster method based on the partial 
past outcome sequence (Yp, ..., Y¥;-1). 
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Proof. The expressions in the lemma follow directly from the recursive definition of the 
weights w;,;—1. First we show that, for t = 1, ...,n, 


t 
Q -t! =n yo LLY 


t= t—1 f : 
$ CoO gE te) (5.1) 


and 


t 
ae a a y #4217! =n Xy KiY,) 
Wit = Wi ag N 2 a) Wr-1e i 


1 Epi j t . 
Py Cees -2 eT" Esa OY), (5.2) 


Clearly, for a given f, (5.1) implies (5.2) by the definition of the fixed share forecaster. 
Since w;,9 = 1/N for every expert i, (5.1) and (5.2) hold for t = 1 andt = 2 (fort = 1 the 
summations are 0 in both equations). Now assume that they hold for some t > 2. We show 
that then (5.1) holds for t + 1. By definition, 


Vins 5 Wipe OO 
t 
a ; a ' tE 
= ay een n + i ya a Ae Wpie™” Day LGYs) 
t'=2 
t 
Le) ry 
N 


rage (1—ay 
a =n Diy LGAs = =n YEL eY, 
= T Xa _ at! t W,_1e nY -y LOY) 4 a e nn iY) 
=? 
and therefore (5.1) and (5.2) hold for all t = 1, ..., n. Now the expression for p;,; follows 
from (5.2) by normalization for t = 2,..., + 1. Finally, the recursive formula for W;_ 
can easily be proved from (5.1). Indeed, recalling that ys Wir = DS; Vi s, we have that for 


any t =2,...,N, 


N t—1 1 t—2 
Yo (SNe = ay Wyse Ei et) 4 PO rE er) 
N N 
i=1 Fez 


a y j a-a- Š i 
n xe — eo Wp 5y gate MEY. o e ` e71 erat GY) 
N i=l A i=l 
a (1 — ay? 
N ya = a) Wp Zr + ap li a 
r=2 


Examining the formula for p;,, = P;[/; = i] given by Lemma 5.4, one may realize that Z, 
may be drawn by the following two-step procedure. 
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THE ALTERNATIVE FIXED SHARE FORECASTER 
Parameters: Real numbers 7 > 0 and0 <a < 1. 
Initialization: For t = 1, choose 7; uniformly from the set {1,..., N}. 
For each round t = 2,3... 


(1) draw t; randomly according to the distribution 


d-a) Zis- RA 
; WW fort = 1 
Piit; = t] = alay Wy Z 
= t’-140! t—1 a 
~ WW, fort =2,...,¢, 
where we define Z; -1 = N; 
(2) given t, = t’, choose J, randomly according to the probabilities 
=ý 
eE, y OY) — 
PAL, =i; =t'1= Za fort’ =1,...,f-1 
1/N fort’ =t. 


Indeed, Lemma 5.4 immediately implies that 


t 
Pia =) PA =i |t =e) [t =r] 
t'=1 
and therefore the alternative fixed share forecaster provides an equivalent implementation 
of the fixed share forecaster. 


Theorem 5.8. The fixed share forecaster and the alternative fixed share forecaster are 
equivalent in the sense that the generated forecasters have the same distribution. More 
precisely, the sequence (I\,...,In) generated by the alternative fixed share forecaster 
satisfies 


P, [L =i] = pis 
for all t andi, where p;i, is the probability of drawing action i at time t computed by the 
fixed share forecaster. 


Observe that once the value t = t’ is determined, the conditional probability of I, = i 
is equivalent to the weight assigned to expert i by the exponentially weighted average 
forecaster computed for the last t — t’ outcomes of the sequence, that is, for (Y;,..., Y;—-1). 
Therefore, for t > 2, the randomized prediction J; of the fixed share forecaster can be 
determined in two steps. First we choose a random time t,, which specifies how many of 
the most recent outcomes we use for the prediction. Then we choose /, according to the 
exponentially weighted average forecaster based only on these outcomes. 

It is not immediately obvious why the alternative implementation of the fixed share 
forecaster is more efficient. However, in many cases the probabilities P,[/,; =i | t = t’] 
and normalization factors Z;;-; may be computed efficiently, and in all those cases, 
since W,_, can be obtained via the recursion formula of Lemma 5.4, the alternative fixed 
share forecaster becomes feasible. Theorem 5.8 offers a general tool for obtaining such 
algorithms. 
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Rather than isolating a single theorem that summarizes conditions under which such an 
efficient computation is possible, we illustrate the use of this algorithm on the problem of 
“tracking the shortest path,” that is, when the base experts are defined by paths in a directed 
acyclic graph between two fixed vertices. Thus, we are interested in an efficient forecaster 
that is able to choose a path at each time instant such that the cumulative loss is not much 
larger than the best forecaster that is allowed to switch paths m times. In other words, the 
class of (base) experts is the one defined in Section 5.4, and the goal is to track the best such 
expert. Because the number of paths is typically exponentially large in the number of edges 
of the graph, a direct implementation of the fixed share forecaster is infeasible. However, the 
alternative fixed share forecaster may be combined with the efficient implementation of the 
exponentially weighted average forecaster for the shortest path problem (see Section 5.4) 
to obtain a computationally efficient way of tracking the shortest path. In particular, we 
obtain the following result. Its straightforward, though somewhat technical, proof is left as 
an exercise (see Exercise 5.13). 


Theorem 5.9. Consider the problem of tracking the shortest path between two fixed vertices 
in a directed acyclic graph described above. The alternative fixed share algorithm can be 
implemented such that, at time t, computing the prediction I, requires time O(t K |E | + t°). 
The expected tracking regret of the algorithm satisfies 


z m 


for all sequences of paths i,, . . . , in such that size (i4, . .. ,i„) < m, where M is the number 
of all paths in the graph between vertices u and v and K is the length of the longest path. 


Note that we extended, in the obvious way, the definition of size (-) to sequences (i1, ..., i,,). 


5.6 Bibliographic Remarks 


The notion of tracking regret, the fixed share forecaster, and the variable share forecaster 
were introduced by Herbster and Warmuth [159]. The tracking regret bounds stated in The- 
orem 5.2, Corollary 5.1, and Theorem 5.3 were originally proven in [159]. Vovk [299] has 
shown that the fixed and variable share forecasters correspond to efficient implementations 
of the exponentially weighted average forecaster run over the set of compound actions with 
a specific choice of the initial weights. Our proofs follow Vovk’s analysis; The work [299] 
also provides an elegant solution to the problem of tuning the parameter œ optimally (see 
Exercise 5.6). 

Minimization of tracking regret is similar to the sequential allocation problems studied 
within the competitive analysis model (see Borodin and El-Yaniv [36]). Indeed, Blum and 
Burch [32] use tracking regret and the exponentially weighted average forecaster to solve 
a certain class of sequential allocation problems. 

Bousquet and Warmuth [40] consider the tracking regret measured against all compound 
actions with at most m switches and including at most k distinct actions (out of the N 
available actions). They prove that a variant of the fixed share forecaster achieves, with 
high probability, the bound of Theorem 5.2 in which the factor mln N is replaced by 


5.7 Exercises 125 


k InN + mlnk (see Exercise 5.1). We also refer to Auer and Warmuth [15] and Herbster 
and Warmuth [160] for various extensions and powerful variants of the problem. 

Tree-based experts are natural evolutions of tree sources, a probabilistic model inten- 
sively studied in information theory. Lemma 5.2 is due to Willems, Shtarkov, and 
Tjalkens [311], who were the first to show how to efficiently compute averages over 
all trees of bounded depth. Theorem 5.4 was proven by Helmbold and Schapire [156]. 
An alternative efficient forecaster for tree experts, also based on dynamic programming, is 
proposed and analyzed by Takimoto, Maruoka, and Vovk [283]. Freund Schapire, Singer, 
and Warmuth [115] propose a more general expert framework in which experts may occa- 
sionally abstain from predicting (see also Section 4.8). Their algorithm can be efficiently 
applied to tree experts, even though the resulting bound apparently has a quadratic (rather 
than linear) dependence on the number of leaves of the tree. Other closely related references 
include Pereira and Singer [233] and Takimoto and Warmuth [285], which consider planar 
decision graphs. The dynamic programming implementations of the forecasters for tree 
experts and for the shortest path problem, both involving computations of sums of products 
over the nodes of a directed acyclic graph, are special cases of the general sum—product 
algorithm of Kschischang, Frey, and Loeliger [187]. 

Kalai and Vempala [174] were the first to promote the use of follow-the-perturbed- 
leader type forecasters for efficiently computable prediction, described in Section 5.4. 
Their framework is more general, and the shortest path problem discussed here is just an 
example of a family of online optimization problems. The required property is that the loss 
has an additive form. Awerbuch and Kleinberg [19] and McMahan and Blum [211] extend 
the follow-the-perturbed-leader forecaster to the bandit setting. For a general framework 
of linear-time algorithms to find the shortest path in a directed acyclic graph, we refer 
to [219]. Takimoto and Warmuth [286] consider a family of algorithms based on the efficient 
computation of weighted average predictors for the shortest path problem. György, Linder, 
and Lugosi [137, 138] apply similar techniques to the ones described for the shortest path 
problem in online lossy data compression, where the experts correspond to scalar quantizers. 
The material in Section 5.5 is based on György, Linder, and Lugosi [139]. 

A further example of efficient forecasters for exponentially many experts is provided by 
Maass and Warmuth [207]. Their technique is used to learn the class of indicator functions 
of axis-parallel boxes when all “instances” x; (i.e., side information elements) belong to the 
d-dimensional grid {1, ... , N}“. Note that the complement of any such box is represented as 
the union of 2d axis-parallel halfspaces. This union can be learned by the Winnow algorithm 
(see Section 12.2) by mapping each original instance x, to a transformed boolean instance 
whose components are the indicator functions of all 2N d axis-parallel halfspaces in the grid. 
Maass an Warmuth show that Winnow can be run over the transformed instances in time 
polynomial in d and log N, thus achieving an exponential speedup in N. In addition, they 
show that the resulting mistake bound is essentially the best possible, even if computational 
issues are disregarded. 
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5.1 (Tracking a subset of actions) Consider the problem of tracking a small unknown subset of k 
actions chosen from the set of all N actions. That is, we want to bound the tracking regret 


>> em, Y) — So eG, Yo), 
t=1 t=1 
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where size (i4, ...,i„) < m and the compound action (i;,...,i,) contains at most k distinct 
actions, k being typically much smaller than m and N. Prove that by running the fixed share 
forecaster over a set of at most N* meta-actions, and by tuning the parameter «œ in a suitable 
way, a tracking regret bound of order 


Yn(kinn +mInk + mInn) 


is achieved (Bousquet and Warmuth [40].) 
Prove Corollary 5.2. 


(Lower bound for tracking regret) Show that there exists a loss function such that for any 
forecaster there exists a sequence of outcomes on which the tracking regret 


sup Ris, ...5 in) 


(istna) size Gruta =m 


is lower bounded by a quantity of the order of vnm ln N. 


Prove that a slightly weaker version of the tracking regret bound established by Theo- 
rem 5.3 holds when £ € [0, 1]. Hint: Fix an arbitrary compound action (ii, ...,i„) with 
size(i1, ..., İn) = m and cumulative loss L*. Prove a lower bound on the sum of the weights 
WoC, +--+» Jn) of all compound actions (j1, ..., jn) with cumulative loss bounded by L* + 2m 
(Herbster and Warmuth [159], Vovk [299]). 

(Tracking experts that switch very often) In Section 5.2 we focused on tracking compound 
actions that can switch no more than m times, where m <n. This exercise shows that it is 
possible to track the best compound action that switches almost all times. Consider the problem 
of tracking the best action when N = 2. Construct a computationally efficient forecaster such 
that, for all action sequences i1, ..., i, such that size (i,;,...,i,) > n — m — 1, the tracking 
regret is bounded by 


Fit vias ft (m+ pin2-+min =P), 
2 m 


Hint: Use the fixed share forecaster with appropriately chosen parameters. 


(Exponential forecaster with average weights) In the problem of tracking the best expert, 
consider the initial assignment of weights where each compound action (i;,..., in), with m = 
size (i;,...,7,), gets a weight 
y (i F ) 1 (=)" (= +a Jog 4 
WoaGis---.in) = a Ea 
ae N\N/ \N 


for each value of a € (0, 1), where € > 0. Using known facts on the Beta distribution (see 
Section A.1.9 in the Appendix) and the lower bound B(a, b) > T(a)(b +(a- 1)“ for all 
a,b > 0, prove a bound on the tracking regret R(i,,...,i,) for the exponentially weighted 
average forecaster that predicts at time ¢ based on the average weights 


1 
wilis in) = Wipes <.. İn) da 
0 


(Vovk [299]). 

(Fixed share with automatic tuning) Obtain an efficient implementation of the exponential 
forecaster introduced in Exercise 5.6 by a suitable modification of the fixed share forecaster. 
Hint: Each action i gets an initial weight w; o(œ) = (ea®~!)/N. At time t, predict with 


Wj,t-1 : 
Pit = Sy. i=—1,...,N, 


Dia W jt-1 
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where 
1 
Wit-1 =f Wi1-1(a) da fori=1,...,N. 
0 


Use the recursive representations of weights given in Lemma 5.4 and properties of the Beta 
distribution to show that w;,, can be updated in time O(N t) (Vovk [299].) 


(Arbitrary-depth tree experts) Prove an oracle inequality of the form 


> Tp, Y:) < min È Ge (4). Y) + (EI + [leaves(E)| my) + on, 


t=1 t=1 
Which holds for the randomized exponentially weighted forecaster run over the (infinite) set of 


all finite tree experts. Here c is a positive constant. Hint: Use the fact that the number of binary 


trees with 2k + 1 nodes is 
1/ 2k 
a= — fork =1,2,... 
k\k-1 


(closely related to the Catalan numbers) and that the series a,'+ay'+--+ is convergent. 
Ignore the computability issues involved in storing and updating an infinite number of weights. 


(Continued) Prove an oracle inequality of the same form as the one stated in Exercise 5.8 when 
the randomized exponentially weighted forecaster is only allowed to perform a finite amount 
of computation to select an action. Hint: Exploit the exponentially decreasing initial weights 
to show that, at time t, the probability of picking a tree expert E whose size || E || is larger than 
some function of t is so small that can be safely ignored irrespective of the performance of the 
expert. 


Show that the number of tree experts corresponding to the set of all ordered binary trees of depth 
at most D is N2”. Use this to derive a regret bound for the ordinary exponentially weighted 
average forecaster over this class of experts. Compare the bound with Theorem 5.4. 
(Exponential perturbation) Consider the follow-the-perturbed-leader forecaster for the short- 
est path problem. Assume that the distribution of Z, is such that it has independent components, 
all distributed according to the two-sided exponential distribution with parameter 7 > 0 so that 
the joint density of Z, is f(z) = (7/2)#!e7"l1, Show that with a proper tuning of n, the regret 
of the forecaster satisfies, with probability at least 1 — ô, 


” 1 
yd y) -L< c (VL*KIE|In]E| + K|E|In|E|) ¥ Ky5in5. 
t=1 


where L* = min; yan Sy LG, y) < Kn and c is a constant. (Note that L* may be as large as Kn 
in which case this bound is worse than that of Theorem 5.6.) Hint: Combine the proofs of 
Theorem 5.6 and Corollary 4.5. 


(Continued) Improve the bound of the previous exercise to 


L 1 
Yel, y)—L* <c (VIEI nM +|E]| nM) m Ky 5 In es 
t=1 


where M is the number of all paths from u to v in the graph. Hint: Bound E max; |i - Z,,| more 
carefully. First show that for all à € (0, n) and path i, E e^?» < (2n°/(n = DoR , and then use 
the technique of Lemma A.13. 

Prove Theorem 5.9. Hint: The efficient implementation of the algorithm is obtained by combin- 
ing the alternative fixed share forecaster with the techniques used to establish the computation- 
ally efficient implementation of the exponentially weighted average in Section 5.4. The regret 
bound is a direct application of Corollary 5.1 (György, Linder, and Lugosi [139].) 
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Prediction with Limited Feedback 


6.1 Introduction 


This chapter investigates several variants of the randomized prediction problem. These 
variants are more difficult than the basic version treated in Chapter 4 in that the forecaster 
has only limited information about the past outcomes of the sequence to be predicted. In 
particular, after making a prediction, the true outcome y; is not necessarily revealed to the 
forecaster, and a whole range of different problems can be defined depending on the type 
of information the forecaster has access to. 

One of the main messages of this chapter is that Hannan consistency may be achieved 
under significantly more restricted circumstances, a surprising fact in some of the cases 
described later. The price paid for not having full information about the outcomes is reflected 
in the deterioration of the rate at which the per-round regret approaches 0. 

In the first variant, investigated in Sections 6.2 and 6.3, only a small fraction of the 
outcomes is made available to the forecaster. Surprisingly, even in this “label efficient” 
version of the prediction game, Hannan consistency may be achieved under the only 
assumption that the number of outcomes revealed after n prediction rounds grows faster 
than log(7) log log(7). 

Section 6.4 formulates prediction problems with limited information in a general frame- 
work. In the setup of prediction under partial monitoring, the forecaster, instead of his own 
loss, only receives a feedback signal. The difficulty of the problem depends on the relation- 
ship between losses and feedbacks. We determine general conditions under which Hannan 
consistency can be achieved. The best achievable rates of convergence are determined in 
Section 6.5, while sufficient and necessary conditions for Hannan consistency are briefly 
described in Section 6.6. 

Sections 6.7, 6.8, and 6.9 are dedicated to the multi-armed bandit problem, an impor- 
tant special case of forecasting with partial monitoring where the forecaster only gets to 
see his own loss £(/;, y;) but not the loss £(i, y,) of the other actions i Æ J,. In other 
words, the feedback received by the forecaster corresponds to his own loss. Hannan con- 
sistency may be achieved by the results of Section 6.4. The issue of rates of convergence 
is somewhat more delicate, and to achieve the fastest possible rate, appropriate mod- 
ifications of the prediction method are necessary. These issues are treated in detail in 
Sections 6.8 and 6.9. 

In another variant of the prediction problem, studied in Section 6.10, the forecaster 
observes the sequence of outcomes during a period whose length he decides and then 
selects a single action that cannot be changed for the rest of the game. 
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Perhaps the main message of this chapter is that randomization is a tool of unexpected 
power. All forecasters introduced in this chapter use randomization in a clever way to 
compensate for the reduction in the amount of information that is made available after each 
prediction. Surprisingly, being able to access independent uniform random variables offers 
a way of estimating this hidden information. These estimates can then be used to effectively 
construct randomized forecasters with nontrivial performance guarantees. Accordingly, the 
main technical tools build on probability theory. More specifically, martingale inequalities 
are at the heart of the analysis in many cases. 


6.2 Label Efficient Prediction 


Here we discuss a version of the sequential prediction problem in which obtaining the value 
of the outcome is costly. More precisely, after choosing a guess at time f, the forecaster 
decides whether to query the outcome (or “label”) Y,. However, the number u(n) of queries 
that can be issued within a given time horizon n by the forecaster is limited. Formally, the 
“label efficient” prediction game is defined as follows: 


LABEL EFFICIENT PREDICTION 


Parameters: Number N of actions, outcome space V, loss function £, query rate 
u:NON. 


For each round t = 1, 2,... 


(1) the environment chooses the next outcome Y, € Y without revealing it; 

(2) the forecaster chooses an action 7, € {1,..., N}; 

(3) the forecaster incurs loss €(/;, Y,;) and each action 7 incurs loss €(i, Y;) where 
none of these values is revealed to the forecaster; 

(4) if less than u(t) queries have been issued so far, the forecaster may issue a new 
query to obtain the outcome Y;; if no query is issued, then Y, remains unknown. 


The goal of the forecaster is to keep the difference 


paeegdN T a ee aed 


n n 
Ly — min Lin = Da Yı)— min, Ba Y,) 
t= t= 


as small as possible regardless of the outcome sequence. We start by considering the finite- 
horizon case in which the forecaster’s goal is to control the regret after n predictions, where 
n is fixed in advance. In this restricted setup we also assume that at most m = u(n) queries 
can be issued, where u is the query rate function. However, we do not impose any further 
restriction on the distribution of these m queries in the n time steps; that is, u(t) = m 
for all £ = 1,...,n. For this setup we define a simple forecaster whose expected regret 
is bounded by n./2(In N)/m. Thus, if m = n, we recover the order of the optimal bound 
of Section 4.2. 

It is easy to see that in order to achieve a nontrivial performance, a forecaster must use 
randomization in determining whether a label should be revealed or not. It turns out that a 
simple biased coin does the job. 
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LABEL EFFICIENT FORECASTER 
Parameters: Real numbers 7 > 0 and0 <€< 1. 
Initialization: wọ = (1,..., 1). 
For each roundt = 1,2,... 


(1) draw an action I, from {1,...,N} according to the distribution p;, = 
Wi t—-1/(W1 -1 +++ + wi) fori = 1,...,N. 

(2) draw a Bernoulli random variable Z, such that P[Z,; = 1] = £; 

(3) if Z, = 1, then obtain Y, and compute 


Wit = Wi, 1 67" 6YE for each i = 1,..., N; 


else, let w, = W;_1.- 


Our label efficient forecaster uses an i.i.d. sequence Z1, Z2,..., Zn of Bernoulli random 
variables such that P[Z, = 1] = 1 — P[Z; = 0] = « and asks the label Y, to be revealed 
whenever Z; = 1. Here € > 0 is a parameter and typically we take £ ~ m/n so that the 
number of solicited labels during n rounds is about m (note that this way the forecaster may 
ask the value of more than m labels, but we ignore this detail as it can be dealt with by a 
simple adjustment). Because the forecasting strategy is randomized, we distinguish between 
oblivious and nonoblivious opponents. In particular, following the protocol introduced in 
Chapter 4, we use Y, to denote outcomes generated by a nonoblivious opponent, thus 
emphasizing that the outcome Y, may depend on the past actions /;,...,/;-1 of the 
forecaster, as well as on Z1, ..., Z;—1. 

The label efficient forecaster is an exponentially weighted average forecaster using the 
estimated losses 
L0,Y,)/e ifZ,=1 

0 otherwise 


Xi, Y,) = | 


instead of the true losses. Note that 


£0, Yo) = E[ 2G, Y) |b, Zi, <, L-1, Z-1] = £, Yp), 


where /;,..., /;-1 is the sequence of previously chosen actions. Therefore, Ui , Y;) may be 
considered as an unbiased estimate of the true loss £(i, Y;). The expected performance of 
this forecaster may be bounded as follows. 


Theorem 6.1. Fix a time horizon n and consider the label efficient forecaster run with 
parameters € = m/n and yn = (42m ln N )/n. Then the expected number of revealed labels 
equals m and 


Boos ; ` 2lIn N 
La — min Lin <n ; 


i=1,...,N i m 


Before proceeding to the proof we introduce the notation 


N n 
Tp, Y) = XO pu či, Y) and Lj,= SG Y,) fori=1,...,N. 
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Proof. The basis of the proof is a simple modification of the proof of Theorem 2.2. For 
t > OletW; = wi, +---+ww,;. On the one hand, 


w, N 
In — = In ein | — In N 
ma 


11 2.05 N 
= =n min, To InN 
On the other hand, for each t = 1,...,n, 
N 
W, FG 
In =Ind ppe t 
Wai 2? 
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IA 


N 2 N 
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i=l i=1 


IA 


2 
~ n ~ 
—ne(p,, Y,) + 2 L(p,, Y,), 
E 


where we used the facts e™* < 1 — x + x?/2 for x > 0, In(1 + x) < x for all x > —1, and 
ti, Y,) € [0, 1/e]. 

Summing this inequality over t = 1,...,n, and using the obtained lower bound for 
In(W,,/Wo), we get, for alli = 1,..., N, 


S x n is InN 
Yep, Y) — Lin < — >- eM, Y) + —. (6.1) 
2e n 
t=1 t=1 
Now note that E poe 1 Up, Y )] = EL, <n. Hence, taking the expected value of both 
sides, we obtain, for alli = 1,..., N, 
A In N 
Hoe ar eon 
2e n 


Substituting the value n = y (2e In N )/n we conclude the proof. W 


Remark 6.1 (Switching to a new action not too often). Theorem 6.1 (and similarly Theo- 
rem 6.2) can be strengthened by considering the following “lazy” forecaster. 
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LAZY LABEL EFFICIENT FORECASTER 
Parameters: Real numbers 7 > 0 and0 <e <1. 
Initialization: wọ = (1,..., 1), Zo = 1. 
For each roundt = 1,2,... 


(1) if Z,;-; = 1, then draw an action J, from {1, ..., N } according to the distribution 
Pit = Wi t-1/(W1, -1 +++ + Wy), i = 1,..., N; otherwise, let J, = [;_1; 

(2) draw a Bernoulli random variable Z, such that P[Z, = 1] = £; 

(3) if Z, = 1, then obtain Y, and compute 


Wig = Wize Oe foreachi = 1,...,N; 


else, let w, = W;_1. 


Note that this forecaster keeps choosing the same action as long as no new queries are issued. 
It can be proven (Exercise 6.2) that the bound of Theorem 6.1 holds for the lazy forecaster 
against any oblivious opponent with the additional statement that, with probability 1, the 
number of changes of an action (i.e., the number of steps where J; # /,;41) is at most the 
number of queried labels. (Note that Lemma 4.1 cannot be used to extend this result to 
nonoblivious opponents as the lemma is only valid for forecasters that use independent 
randomization at each time instant.) A similar result can be proven for Theorem 6.2 
(Exercise 6.3). 


Theorem 6.1 guarantees that the expected per-round regret converges to 0 whenever 
m —> ooasn — oo. However, a bound on the expected regret gives very limited information 
not exclude the possibility that the regret has enormous fluctuations around its mean. Just 
note that the outcomes Y, may depend, in any complicated way, on the past randomized 
actions of the forecaster, as well as on the random draws of the variables Z, for s < t. The 
next result shows that the regret cannot exceed by much its expected value and it is, with 
overwhelming probability, bounded by a quantity proportional to n./(n N)/m. 


Theorem 6.2. Fix a time horizon n and a number 6 € (0, 1). Consider the label efficient 
forecaster run with parameters 


| m — „2m coo 2e ln N 
€ = max į 0, —————_—_——_ and n = . 
n n 


Then, with probability at least 1 — ô, the number of revealed labels is at most m and 


as : InN In(4N /8) 
Vt=1,...,.n L,-— min Li: < 2n,/ — + 6n,/ ———. 
i N m m 


ener 


Before proving Theorem 6.2 note that if 6 < 4Ne~”/8, then the right-hand side of the 
inequality is greater than n and therefore the statement is trivial. Thus we may assume 
throughout the proof that 6 > 4Ne~""/8. This ensures that € > m/(2n) > 0. We need a few 
of preliminary lemmas. 
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Lemma 6.1. The probability that the label efficient forecaster asks for more than m labels 
is at most 5/4. 


Proof. Note that the number of labels M = Z; + --- + Zp asked is binomially distributed 
with parameters n and e. Therefore, writing y = m/n — € = n~!./2mIn(4/5) and using 
Bernstein’s inequality (see Corollary A.3) we obtain 


P[M > m] = P[M -EM > ny] < et! (284719) < ey’ /m < a E 


The next lemma relates the “expected” loss L(p,, YJ) = ee Di,s€(, Ys) to the estimated 
loss €(p,, Ys) = 2, pistli, Ys). 


Lemma 6.2. With probability at least 1 — 6/4, for allt =1,...,n 


na 


Zp,, Up,. ¥.) + 
Dtp.) = Dip ori 


Furthermore, with probability at least 1 — 6/4, for all t = 1,...,n andi = 1,..., N, 


In(4N /ô 
E aR ee “ae 
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Proof. The proof in both cases is based on Bernstein’s inequality for martingales (see 
Lemma A.8). To prove the first inequality, introduce the martingale S, = X; +---+ X;, 


where X, = L(p,, Y,)— Lp, Y;), 5 =1,...,n, is a martingale difference sequence with 
respect to the sequence (/,, Z;). For all t = 1,...,7, define 
t 
eS E[X? | zt, 7]. 
s=1 
Now note that, for alls = 1,...,n, 


[X71 Ze] = E[ @@,, ¥)- Fo, Yo) ZO] 
< E [Eps Ys)? |Z) < 1/e, 


which implies that £? < n/e for all t = 1, ..., n. We now apply Lemma A.8. noting that 
max; |X,| < 1/e with probability 1. For u > 0 this yields 


P| max, Sı > u| =P| max, S, > u and X? < =] 
t E 


i E R 


eee 


uz 
= exp ( NE E) 


Choosing u = (4//3)n./(1/m) In(4/5), and using In(4/8) < m/8 implied by the assump- 
tion ô > 4Ne—”/8, we see that u < n. This, combined with £ > m/(2n), shows that 


u? u? 3u? m 4 
> 


> =In-, 
2(nfe+u/GBe)) ~ (8/3)n/e 7 16n? ô 
thus proving the first inequality. 
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To prove the second inequality note that, as just seen, for each fixed i, we have 


p| ia, Fig — Lig > @/VInf ZEN 2| a 


An application of the union-of-events bound to į = 1,..., N concludes the proof. W 


Proof of Theorem 6.2. When m < InN, the bound of the theorem is trivial. Therefore, 
we only need to consider the case m > InN. In this case ¢ > m/(2n) implies that 1 — 
n/(2e) > 0. Thus, a straightforward combination of Lemmas 6.1 and 6.2 with (6.1) shows 
that, with probability at least 1 — 38/4, the following events simultaneously hold (where 


L; = X,- £@,, ¥s)): 


1. the strategy asks for at most m labels; 
2. forallt =1,...,n 


(1 n E a en ees 8 1 l 4N i In N 
— — min L; n n : 
Qe)" Tise N O A3 Vim ô n 

Since L, < n for all t < n, the inequality implies that for all t = 1..., n 


= Pa Li JaN InN 
wa it S x t 


e 


by our choice of 7, and using 1/(2¢) < n/m derived from ¢ > m/(2n). The proof is 
finished by noting that the Hoeffding-Azuma aren? (see Lemma A.7) implies that, 
with probability at least 1 — 6/4, for allt = 1,. 


= n 4: a 4N 
L,<L, —ln- < L, + nt, 
3! 5 


where we used m < n inthe last step. E 


Theorem 6.1 does not directly imply Hannan consistency of the associated forecast- 
ing strategy because the regret bound does not hold uniformly over the sequence length 
n. However, using standard dynamical tuning techniques (such as the “doubling trick” 
described in Chapter 2) Hannan consistency can be achieved. The main quantity that 
arises in the analysis is the query rate u(n), which is the number of queries that can be 
issued up to time n. The next result shows that Hannan consistency is achievable whenever 
u(n)/Qog(n) log log(n)) > oo. 


Corollary 6.1. Let u : N — N be any nondecreasing integer-valued function such that 


u(n) 
im = 
noo log,(n) log, log,(n) 


Then there exists a Hannan consistent randomized label efficient forecaster that issues at 
most u(n) queries in the first n predictions for any n € N. 


6.2 Label Efficient Prediction 135 


Proof. The algorithm we consider divides time into consecutive epochs of increasing 
lengths n, = 2” forr = 0, 1,2,.... In the rth epoch (of length 2”) the algorithm runs the 
forecaster of Theorem 6.2 with parameters n = 2’,m = m,,and6, = 1/(1 + ry}, where m, 
will be determined by the analysis (without loss of generality, we assume that the forecaster 
always asks for at most m, labels in each epoch r). Our choice of ô, and the Borel—Cantelli 
lemma implies that the bound of Theorem 6.2 holds for all but finitely many epochs. Denote 
the (random) index of the last epoch in which the bound does not hold by R. Let L be 
cumulative loss of the best action in epoch r and let LO be the cumulative loss of the 
forecaster in the same epoch. Introduce R(n) = |log, n]. Then, by Theorem 6.2 and by 
definition of R , for each n and for each realization of J” and Z”, we have 


La- Ly < X (LO -L) = Yi) =, min D LCG, Yn) 


Hees 


r=0 t=2Rn) Ape =2R(n) 
R(n) 
. [DN + 1°) 4N(r + 1)? 
r=R+1 
This, the finiteness of R and 1/n < 2~®” imply that, with probability 1, 
L,- L? In(4N(r +1? 
lim sup ——— < 8 lim sup 27 yy DNCE 
n—=>œ0 n R=>œ r0 My 


Cesaro’s lemma ensures that this lim sup equals 0 as soon as m, / lnr — +00. It remains to 
see that the latter condition is satisfied under the additional requirement that the forecaster 
does not issue more than u(n) queries up to time n. This is guaranteed whenever mo + 
Mı +++++ mpm < u(n) for each n. Denote by ¢ the largest nondecreasing function such 
that 


u(t) 
gt) < (1 + log, t) log (1 + log, t) 


for all t =1,2,.... 


As u grows faster than log, (n) log, log, (n), we have that (t) —> +00. Thus, choosing mo = 

0, and m, = |ġ(2")log,(1 +r)], we indeed ensure that m,/lnr —> +00. Furthermore, 

considering that m, is nondecreasing as a function of r, and using the monotonicity of ¢, 
R(n) 


Yom, < (R(n) + DER) log, + R) 
r=0 


< (1 + log, n)ġ(n)log,(1 + log, n) < u(n). 


This concludes the proof. W 


We also note here that in case the cumulative loss of the best action L* = minj=1 
is small, Theorem 6.2 may be improved by a more careful analysis. In particular, an upper 


bound of the order 
[nL*(In Nn) = n(n Nn) 
m m 


holds. Thus, when L* = 0, the rate of convergence becomes significantly faster. For the 
details we refer the reader to Exercise 6.4. 


136 Prediction with Limited Feedback 


6.3 Lower Bounds 


Here we show that the performance bounds proved in the previous section for the label 
efficient exponentially weighted average forecaster are essentially unimprovable in the 
strong sense that no other label efficient forecasting strategy can have a significantly better 
performance for all problems. The main result of the previous section is that there exists a 
randomized forecasting algorithm such that its regret is bounded by a quantity of the order 
of n./InN/m if m is the maximum number of revealed labels during the n rounds. The 
next result shows that for all possible forecasters asking for at most m labels, there exists a 
prediction problem such that the expected excess cumulative loss is at least a constant times 
n/In N/m. In other words, the best achievable per-round regret is of the order ./In N/m. 
Interestingly, this does not depend on the number n of rounds but only on the number m of 
revealed labels. Recall that in the “full information” case of Chapter 4 the best achievable 
per-round regret has the form ./In N /n. Intuitively, m plays the role of the “sample size” 
in the problem of label efficient prediction. 

To present the main ideas in a transparent way, first we assume that N = 2; that is, 
the forecaster has two actions to choose from at each time instance, and the outcome 
space is binary as well, for example, Y = {0, 1}. The general result for N > 2 is shown 
subsequently. Consider the simple zero-one loss function given by £(i, y) = Izy). Then 
we have the following lower bound. 


Theorem 6.3. Let n > m and consider the setup described earlier. Then for any, possibly 
randomized, prediction strategy that asks for at most m labels, 


das n 
sup & La — min Lj, ) > 0.14— 
y”e{0,1}” ( Po ioi Jm 


where the expectation is taken with respect to the randomization of the forecaster. 


Proof. As is usual in proofs of lower bounds, we construct a random outcome sequence 
and show that the expected value (with respect to the random choice of the outcome 
sequence) of the excess loss Ln — min;—o,1 Li,n is bounded from below by 0.14/,/m for 
any prediction strategy, randomized not. This obviously implies the theorem, because the 
expected value can never exceed the supremum. 

To this end, we define the outcome sequence Y;,..., Y, as a sequence of random vari- 
ables whose joint distribution is defined as follows. Let o be a symmetric Bernoulli random 
variable P[o = 0] = P[o = 1] = 1/2. Given the event o = 1, the Y,’s are conditionally 
independent with distribution 


1 
PIY, =1|o=1]=1-Pl¥,=0|o=1]=5 +e, 


where € is a positive number specified later. Given o = 0, the Y,’s are also conditionally 
independent with 


1 
PIY, =1|o =0])=1-Pl¥,=0|o =0]= 5 =£. 


Now it suffices to establish a lower bound for E [Ea — minj=o,1 L in] when the outcome 
sequence is Y4, ..., Y, and the expected value is taken with respect to both the random- 
ization of the forecaster and the distribution of the Y;’s. Note that we may assume at this 
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point that the prediction strategy is deterministic. Since any randomized strategy may be 
regarded as a randomized choice from a class of deterministic strategies, if the lower bound 
is true for any deterministic strategy, by Fubini’s theorem it must also hold for the average 
over all deterministic strategies according to the randomization. First note that 


č min {Lo,n. Lin} 


= =( i[min{Lon, Lin} |o = 1] +E[min{Lon, Lin} |o = 0)) 


< 5(min{ElLon |o = 1], Elin |o = 1} 


+ minfE[Lo.n |o = 0], E[Lin | 0 = ol}) 


and therefore 


|E, — min Li > L, —n (5 -s). 
i=0,1 7 2 

It remains to prove a lower bound for a ae = ae UL(,, Y+), where I, is the action chosen 
by the predictor at time t. Denote by 1 < Tı < ... < Tx < nthe prediction rounds when the 
forecaster queried a label, and let Z; = Y7,,..., ZK = Yr, be the revealed labels. Here the 
random variable K denotes the number of labels queried by the forecaster in n rounds. By 
the label efficient assumption, K < m. Recall that we consider only deterministic prediction 
strategies. Since the predictions of the forecaster can only depend on the labels revealed 
up to that time, such a strategy may be formalized by a function g; : {0, 1, x}” — {0, 1} so 
that at time ¢ the forecaster outputs 


Ty = g(Z1,..., ZK, *,...;,%). 
——” 
m—K times 
Defining the random vector Z = (Z1,..., Zx ), by a slight abuse of notation we may simply 
write 
I, = g,(Z). 


The expected loss at time ¢ is then 


m 


2 LCL, Yi) = Ple(Z) 4 Vi) = X PIK =k)P[g(Z) AY, | K =k]. 
k=1 


To simplify notation, in the sequel we use 


PL] = PL |K =k] and {x[-] = EL | K = k] 


to denote the conditional distribution and expectation given that exactly k labels have been 
revealed up to time fr. Since, for any f, the event {T; = t } is determined by the values taken 
by Y1,..., ¥;, a well-known fact of probability theory (see, e.g., Chow and Teicher [59, 
Lemma 2, p. 138]) implies that Z; = Y7,,..., ZK = Yr; are independent and identically 
distributed, with the same distribution as Y4. It follows by a basic identity of the theory of 
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binary classification (see Lemma A.15) — by taking Y, as Y and the pair (Z, o) as X — that 


1 
Prig: (Z) # Y,] = mes (2e)Pi lg(Z) 4 o]. 


Thus, it suffices to obtain a lower bound for P,[g,;(Z) #4 o]. By Lemmas A.14 and A.15, 
this probability is at least 


ių min{ Plo = 0 | Z], Po = 1 | ZJ}. 


Observe that for any z € {0, 1}, 


P,[o E ETA a PZ = 21 9 = 0) 


P,[Z = z] 
(0/2 +8)” (1/2 — 8)" 
~ 2P,[Z = z] i 


where no and nı denote the number of 0’s and 1’s in the string z € {0, 1%. Similarly, 
(1/2 — €)" (1/2 + e)" 


Pilo = 1 | Z = z] = Te 


and therefore 
Pxlg(Z) #0] 

> ių min{ P;[o =0|Z],P.f[o = 1 | Z)} 

= > P,[Z = z| min{P;[o = 0| Z = z], Plo = 1 | Z = z]} 

ze{0, 1}* 

1 

= 5 min{(1/2 — £)" (1/2 + e)", (1/2 — ey" (1/2 + ey} 
ze{0, 1} 
ic 

=5 2 


k 
> (jJar — 8) (1/2 + £) 


J=[k/2] 


ele >35]; 
2 


where B is a binomial random variable with parameters k and 1/2 — €. By the central limit 
theorem this probability may be bounded from below by a constant if ¢ is at most of the 
order 1//k. Since k < m, this may be achieved by taking, for example, ¢ = 1/(4./m). 

To get a nonasymptotic bound, we use Slud’s inequality (see Lemma A.10) implying 
that 


) min{(1/2 — e)/(1/2 + eh 7, (1/2 — e) (1/2 + 8) } 
J 


Pfs . ] zi o( evk 
2 VO- 0/2 +) 
1 o( A ) > 1 — ® (0.577) > 0.28 
= J/1/4— 1/16m)) ~ i , 


where © is the distribution function of a standard normal random variable. Here we used 
the fact that k < m. 
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In summary, we have proved that 


oe an 1 
|E, — min Lin| > EL, -n (5 — e) 
i=0,1 2 


n m 1 
XO J PIK =k] Pilg:(Z2) AY) -n G - e) 
t=1 k=1 


= 2¢ > X PIK = k] Pilg: (Z) 4 o] 


t=1 k=1 


> Ine x 0.28 = 0.14 


as desired. W 


Now we extend Theorem 6.3 to the general case of N > 2 actions. The next result shows 
that, as suggested by the upper bounds of Theorems 6.1 and 6.2, the best achievable regret 
is proportional to the square root of the logarithm of the number of actions. 


Theorem 6.4. Assume that N is an integer power of 2. There exists a loss function £ 
such that for all sets of N actions and for all n > m > 4(/3 — 1} log, N, the expected 
cumulative loss of any forecaster that asks for at most m labels while predicting a binary 
sequence of n outcomes satisfies the inequality 


~~ y log, N 
sup La — min Lin} = cn ——., 
y”e{0,1}” Th sss, N m 


where c = ((V/3 — 1)/8)e~¥?/? > 0.0384. 


Proof. The way to proceed is similar to Theorem 6.3, though the randomizing distribution 
of the outcomes and the losses needs to be chosen more carefully. The main idea is to 
consider a certain family of prediction problems defined by sequences of outcomes and 
loss functions and then show that for any forecaster there exists a prediction problem in the 
class such that the regret of the forecaster is at least as large as announced. 

The family is parameterized by three sequences: the “side information” sequence 
X = (x,...,X,) with 


X1,..-,Xn E {1,2,..., log, N}, 


the binary vector ø = (0) ..., Clog, N) € {0, pees , and the sequence u of real numbers 
Uui, ..., Un € [0, 1]. For any fixed value of these parameters, define the sequence of out- 
comes by 


_ fl ifu, < 4+ (20, — 1) 
ENG otherwise, 


where ¢ > 0 is a parameter whose value will be set later. The N actions correspond to 
the binary vectors b = (b1 ..., biog, n) € {0, 1}'°82" such that the loss of action b at time 
t is £(b, y) = Tto,,éy:}« For any forecaster that asks for at most m labels, we establish a 
lower bound for the maximal regret within the family of problems by choosing the problem 
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randomly and lower bounding the expected regret. To this end, we choose the parameters 
x, ø, and u randomly such that X;,..., X, are independent and uniformly distributed 
over {1,..., log, N}, the U; are also i.i.d. uniformly distributed over [0, 1], and the ø; are 
independent symmetric Bernoulli random variables. Note that with this choice, given X, 
and ø, the conditional probability of Y, = 1 is 1/2 + € if ox, = 1 and 1/2 — e if oy, = 0. 
Now we have 


sup ( iT, — min Lin) > E 2 — min Lin) P 
x,o,U b b 
where the expected value on the right-hand side is taken with respect to all randomizations 


introduced earlier, as well as the randomization the forecaster may use. 
Clearly, for each fixed value of ø, 


, , 1 
£xX,U min Lon < min èx, U Lon =N G =e 


where Ex y denotes expectation with respect to X and U (i.e., conditioned on ø). Thus, if 
I, denotes the forecaster’s decision at time t, 


PA 2 1 
) (2, = min Lin) = 2 (Pu. x Y;] = G T e)) pi 


By the same argument as in the proof of Theorem 6.3 it suffices to consider forecasters 
of the form I; = g,(Z), where, as before, g, is an arbitrary {0, 1}-valued function and 
Z = (Y7,,..., Yr, ) is the vector of outcomes at the times in which the forecaster asked for 
labels. To derive a lower bound for 


n m I 
> inf (> PIK = k] Pilg: (Z) 4 Ys] — G g ‘) 
1 
2 y Ss. PIK =k] (Pasta #Y)— (5 a e)) i 


note that for each f, 


inf PLs: (Z) #4 Y] 2 inf Prlg(Z) #Y;] 


where Z’ = (X Tio YT,» <--> XTg» YT, Xr) is the vector of revealed labels augmented by the 
corresponding side information plus the actual side information X,. Now g; can be any 
binary-valued function of Z’. 

Consider the classification problem of guessing Y, from the pair Z’, ø (see Section A.3 
for the basic notions). The Bayes decision is clearly 9*(Z’, o) = ox, whose probability of 
error is 1/2 — £. Thus, by Lemma A.15, 


/ 1 / t 
Pls; (Z # Yı] — (5 = e) = (2¢)Pilg(Z) F ox,], 
where we used the fact that P,[Y, = 1 | Z’, o] = 1/2 + e(ox, — 1). To bound Py [g;(Z^) 4 


ox,] we use Lemmas A.14 and A.15. Clearly, this probability may be bounded from below 
by the Bayes probability of error L*(Z’, ox,) of guessing the value of oy, from Z’. 


6.3 Lower Bounds 141 


All we have to do is to find a suitable lower bound for 


Pilg (Z) # ox,] > L*(Z', ox,) = Ex min{Pi{ox, = 1 | Z], Pelox, = 0 | Z1}. 


Observe that 


1/2 if X, Æ Xn,- Xe Xn 


Pxlox, = 1 | Yi, ..., Y;,] otherwise, 


Pylox, = 1 | Z’) = | 


where i1, ..., ij are those indices among t1, .. . , tk for which X;, =... = X;, = X;. 
Using i to denote the value of X, in Z’, we compute 


Plo; = 1 | Y; = y1, ..., Yi, = yj] 


for y1, ..., yj € {0, 1}. Denoting with jọ and jı the numbers of 0’s and 1’s in y;,..., Yj, 
we see that 


Plo; = 1 | Y; = y1, ..., Yi, = yj] 
E (1 — 26)(1 + 2e)? 
= G—2e)hC E2 + (1 + Bey — Dey’ 


Therefore, if X, = X; =... = X;, =i, then 


min{P;[ox, = 1 | Z], Pilox, = 0 | Z} 
min {(1 — 2e)(1 + 2e)", (1+ 2e)(1 — 2e)’ } 
(1 — 2eye(1 + Dey + (1 + Dey — Dey 
: e\ Jo—J 
min f1, (3) . ‘| 
1+ (deze 
1 


1+ (j= g 


In summary, denoting a = (1 + 2¢)/(1 — 2e), we have 


ok 1 ~ 1 
L (Z > ox,) = Ik 
1 4 al Err 2h] 


IV 


1 


1 z 
pa iy C-D 


ae 4| Dex, 12%, -D| 
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Next we bound a | dei: x, =1@Yn E D|. Clearly, if B(j,q) denotes a binomial random 
variable with parameters j and q and Ez o,=y denotes the E; further conditioned on o1 = y, 


ah >. QY, -= D| = Plo = 0] Ero0| >> Y,- 1) 


1:X,=1 1:X,=1 


+ Plo, = 


sy 


Saat) >, Y,- 1) 


1:X,=1 


= Eko0| >> Yn- D| (by symmetry) 


1:X,=1 


ENE oe 
= D(a- FE |2B(j, 1/2—8) - j 


j=0 


’ 


where we write p = 1/ log, N. However, by straightforward calculation we see that 


| |2B(j, 1/2—e)— j| < J [(2B(j, 1/2 — 2) — jf) 
= y j0 — 42?) + 4 j2e? 
<2je+yj. 


Therefore, applying Jensen’s inequality, we get 


k 
D(a — p} E |2B(j, 1/2 — £) — j| < 2kpe + ykp. 
j=0 


Summarizing what we have obtained so far, we have 


sup ( zL, — min Ly) > SOS PIK =k]Q2e)L*(Z’, ox,) 


ideas t=1 k=1 


P[K = KN(2e) 5 a Mel tT 


ie 
M 


nea Ze! log: N—a/m/ log, N (because k < m) 
nee D(2me/ loss N+,/m/ log, N) 


(using lna <a — 1) 


8me? 4e m 
= ne exp : 
(1—2e)log,N 1-—2¢Y log, N 


We choose £ = c,/log, N/m, which is less than 1/4 if m > (4c)? log, N. With this choice 
the lower bound is at least nc,/log, N/m e~!6e"—8¢_ This is maximized by choosing c = 
(v3 —1) /8, which leads to the announced result. W 


IV 


IV 
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6.4 Partial Monitoring 


In a wide and important class of prediction problems the forecaster does not know the 
losses he has suffered in the past and instead receives feedback often carrying quite lim- 
ited information. Such situations have often been called prediction problems with partial 
monitoring. 

Before describing the formal setup, consider the following dynamic pricing problem as a 
nontrivial example of a partial-monitoring problem. A vendor sells a product to a sequence 
of customers who he attends one by one. To each customer, the seller offers the product 
at a price he selects, say, from the set {1,..., N}. The customer then decides whether to 
buy the product. No bargaining is possible and no other information is exchanged between 
buyer and seller. The goal of the seller is to achieve an income almost as large as if he 
knew the maximal price each customer is willing to pay for the product. Thus, if the price 
offered to the rth customer is J, and the highest price this customer is willing to pay is 
Y, € {1,..., N}, then the loss of the seller is 


(Y; — Dilu, <y} + clu, >y, 
N 


where 0 < c < N. Thus, if the product is bought (i.e., Z, < Y,), then the loss of the seller 
is proportional to the difference between the best possible and the offered prices, and if 
the product is not bought, then the seller suffers a constant loss c/N. (The factor 1/N is 
merely here to assure that the loss is between 0 and 1.) In another version of the problem 
the constant c is replaced by Y,. In either case, if the seller knew in advance the sequence 
of Y,’s then he could set a constant price i € {1,..., N } that minimizes his overall loss. A 
question we investigate in this section is whether there exists a randomized algorithm for 


the seller such that his regret 


Se. Y) nin | Xei, Y) 


tos 


4(L,, Y;) oa 


’ 


is guaranteed to be o(n) regardless of the sequence Y1, Y2, ... of highest customer prices. 
The difficulty in this problem is that the only information the seller (i.e., the forecaster) has 
access to is whether I, > Y,, but neither Y, nor €(/;, Y,) are revealed. (Note that if Y, were 
revealed to the forecaster in each step, then we would be back to the full-information case 
studied in Section 4.2.) 

We treat such limited-feedback (or partial monitoring) prediction problems in a 
more general framework, which we describe next. The dynamic pricing problem is a 
special case. 

In the prediction problem with partial monitoring, we assume that the outcome space 


is finite with M elements. Without loss of generality, we assume that Y = {1,..., M}. 
Denote the matrix of losses by L = [€(, j)]vxy. (This matrix is assumed to be known 
by the forecaster.) If, at time t, the forecaster chooses an action J; € {1,..., N} and 


the outcome is Y, € V, then the forecaster suffers loss €(U,, Y;). However, instead of the 
outcome Y,, the forecaster only observes the feedback A(/;, Y;), where h is a known 
feedback function that assigns, to each action/outcome pair in {1,..., N} x Y, an element 
of a finite set S = {s1, ..., Sm} of signals. The values of h are collected in a feedback matrix 
H = [A(, j)]wxm. Here we also allow a nonoblivious adversary; that is, the outcome Y, 
may depend on the past actions /;,..., /;-1 of the forecaster. 
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The partial monitoring prediction game is described as follows. 


PREDICTION WITH PARTIAL MONITORING 


Parameters: finite outcome space Y = {1, ..., M }, number of actions N, loss function 
£, feedback function h. 


For each roundt = 1,2..., 


(1) the environment chooses the next outcome Y, € Y without revealing it; 

(2) the forecaster chooses an action 7; € {1,..., N} 

(3) the forecaster incurs loss €(/;, Y;) and each action į incurs loss €(i, Y;), where 
none of these values is revealed to the forecaster; 

(4) the feedback h(/,, Y,) is revealed to the forecaster. 


We note here that some authors consider a more general setup in which the feedback 
may be random. For the sake of clarity we stay with the simpler problem described above. 
Exercise 6.11 points out that most results shown below may be extended, ina straightforward 
way, to the case of random feedbacks. 

It is an interesting and complex problem to investigate the possibilities of a predictor 
only supplied by the limited information of the feedback. For example, one may ask under 
what conditions it is possible to achieve Hannan consistency, that is, to guarantee that, 
asymptotically, the cumulative loss of the predictor is not larger than that of the best 
constant action, with probability 1. Naturally, this depends on the relationship between the 
loss and feedback functions. Note that the predictor is free to encode the values h(i, j) 
of the feedback function by real numbers. The only restriction is that if A(i, j) = h(i, j^, 
then the corresponding real numbers should also coincide. To avoid ambiguities by trivial 
rescaling, we assume that |A(i, j)| < 1 for all pairs (7, j). Thus, in the sequel we assume 
that H = [A(i, j)]yxm is a matrix of real numbers between —1 and 1 and keep in mind 
that the predictor may replace this matrix by Hy = [@;(A(, J))]w xm for arbitrary functions 
¢ : [-1, 1] > [-1, 1], į = 1,..., N. Note that the set S of signals may be chosen such 
that ithas m < M elements, though after numerical encoding the matrix may have as many 
as MN distinct elements. 

Before introducing a general prediction strategy, we describe a few concrete examples. 


Example 6.1 (Multi-armed bandit problem). In many prediction problems the forecaster, 
after taking an action, is able to measure his loss (or reward) but does not have access 
to what would have happened had he chosen another possible action. Such prediction 
problems have been known as multi-armed bandit problems. The name refers to a gambler 
who plays a pool of slot machines (called “one-armed bandits”). The gambler places his 
bet each time on a possibly different slot machine and his goal is to win almost as much 
as if he had known in advance which slot machine would return the maximal total reward. 
The multi-armed bandit problem is a special case of the partial monitoring problem. Here 
we may simply take H = L, that is, the feedback received by the forecaster is just his 
own loss. This problem has been widely studied both in a stochastic and a worst-case 
setting, and has several unique features, which we study separately in Sections 6.7, 6.8, 
and 6.9. 
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Example 6.2 (Dynamic pricing). In the dynamic pricing problem described in the intro- 
duction of the section we may take M = N and the loss matrix is 


G = Dlica + less 
N ; 
The information the forecaster (i.e., the vendor) receives is simply whether the predicted 


value Z, is greater than the outcome Y;. Thus, the entries of the feedback matrix H may be 
taken to be h(i, j) = Ips j} or, after an appropriate re-encoding, 


L=([¢@, DIvxn where 0, j)= 


h(i, j) = alg<j + bles jy, bof Sj 1ycxe iN 


where a and b are constants chosen by the forecaster satisfying a, b € [—1, 1]. 


Example 6.3 (Apple tasting). Consider the simple example in which N = M = 2 and the 
loss and feedback matrices are given by 


1 0 a a 
L=[; j and "=f; ‘|: 


Thus, the predictor only receives feedback about the outcome Y, when he chooses the 
second action. This example has been known as the apple tasting problem. (Imagine that 
apples are to be classified as “good for sale” or “rotten.” An apple classified as “rotten” 
may be opened to check whether its classification was correct. On the other hand, since 
apples that have been checked cannot be put on sale, an apple classified “good for sale” is 
never checked.) 


Example 6.4 (Label efficient prediction). A variant of the label efficient prediction problem 
of Sections 6.2 and 6.3 may also be cast as a partial monitoring problem. Let N = 3,M = 2, 
and consider loss and feedback matrices of the form 


1 1 a b 
L=]0 1 and H=|/c c 
1 0 Cie 


In this example the only times useful feedback is received are when the first action is played, 
in which case a maximal loss is incurred regardless of the outcome. Thus, just as in the 
problem of label efficient prediction, playing the “informative” action has to be limited; 
otherwise there is no hope for Hannan consistency. 


Remark 6.2 (Compound actions and dynamic strategies). In setting up the problem, for 
simplicity, we restricted our attention to forecasters whose aim is to predict as well as the 
best constant action. This is reflected in the definition of regret in which the cumulative 
loss of the forecaster is compared with min;=1,...,N Yi (i, Y,), the cumulative loss of the 
best constant action. However, just as in the full information case described in Sections 4.2 
and 5.2, one may define the regret in terms of a class of compound actions (or dynamic 
strategies). Recall that a compound action is a sequence i= (i1, ..., in) of actions i; € 
{1,..., N}. Given a class S of compound actions, one may define the regret 


(L, Y;) — mi Lli, Y;). 
deh ) pe GY) 
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The forecasters defined for the partial monitoring problem introduced below can be easily 
generalized to handle this more general case; see, for example, Exercise 6.16. 


6.5 A General Forecaster for Partial Monitoring 


We now present a general prediction strategy that works in the partial monitoring setting 
provided the feedback matrix contains “enough information” about the loss matrix. The 
forecaster is a modification of the exponentially weighted average forecaster (see Sec- 
tion 4.2). The main modification is that the losses £(7, Y,) are now replaced by appropriate 
estimates. However, an estimate with the desired properties can only be constructed under 
some conditions on the loss and feedback matrices. 

The crucial assumption is that there exists a matrix K = [k(i, j)]vx such that L = KH, 


that is, 
H 
H and H 
have the same rank. In other words, we may write, for all 7 € {1,..., N} and j € 
{1,..., M}, 


N 
eG, D= JO KG, DAQ, j). 
l=1 
In this case one may define the estimated losses Z by 
~ kG, IÐh(L,, Y 
Ui, y,) — e i=1,..., N 
Pht 


and their sums as L in = Xi, Yi) +--+ Xi, Y,,). The forecaster for partial monitoring is 
then defined as follows. 


A GENERAL FORECASTER FOR PARTIAL MONITORING 


Parameters: matrix of losses L, feedback matrix H, matrix K such that L = KH, real 
numbers 0 < n, y < 1. 


Initialization: wọ = (1,..., 1). 
For each roundt = 1,2,... 


(1) draw an action J, € {1,..., N } according to the distribution 


Wit-l Y ; 
? += tcl NE 
Wi-1 N 
(2) get feedback h, = h(I, Y,) and compute bx =k, 1)hi/pr. for all i = 
1,...,N; N 
(3) compute w; s = wipe OY) foralli = 1,..., N. 


Pir =(—y) 


As this forecaster is randomized, we carry out its analysis in the nonoblivious opponent 
model and thus write the outcomes as random variables Y,. 
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Denoting by E, the conditional expectation given 71, ..., I+—1 (i.e., the expectation with 
respect to the distribution p, of the random variable /;), observe that 


N 
LLG, Y;) = Sop eee 
j=l af 


N 
= X kii, PAG. Y) = Lli, Yd) 
j=l 


and therefore Ui , Y,) is an unbiased estimate of the loss €(7, Y,). 

The next result bounds the expected regret of the forecaster defined above. It serves as a 
first step of a more useful result, Theorem 6.6, which states that a similar bound also applies 
for the actual (random) regret with overwhelming probability. In this result the per-round 
pe N Lin) decreases to 0 at a rate n~!/3, This is significantly slower 
than the best rate n~'/* obtained in the “full information” case. In Section 6.6 we show that 
this rate cannot be improved in general. Thus, the price paid for having access only to some 
feedback except for the actual outcomes is the deterioration in the rate of convergence. 
However, Hannan consistency is still achievable whenever the conditions of the theorem 
are satisfied. (See also Theorem 6.6 for a significantly more precise statement.) 


Theorem 6.5. Consider a partial monitoring game (L, H) such that the loss and feedback 
matrices satisfy L = KH for some N x N matrix K, with k* = max{1, max, j |k(i, DI} 
Then the forecaster for partial monitoring, run with parameters 


Aan? i ae N2Inn\'2 
I= CG Nn ai An n , 


where C = (k*,/e — 2)’, satisfies 


e «h, ro ma ps ai, ro <3(eve= 2) (Nay? dn NVA 
i=1,...,N 


for alin > (nN)/(Nk* Ve — 2). 


Proof. The first steps of the proof are based on a simple modification of the standard 


argument of Theorem 2.2. On the one hand, for any j = 1, ..., N we have 
In Ma = ln 0 at) —InN > —nL;, —InN 
Wo i=l 7 n” l 
On the other hand, for each t = 1,...,n, 


W Nw 

ore sa 

In — =In) eg ey) 
Wi-1 = Wi-1 


N 
a nY Dit — Y/N e iY) 
= Siar 5 

i=1 
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N 

it — Y/N ~, ~, 

< In)? awe (1 — n€@, Y,) + (e — 2) n? EU, Yp’) 
i=1 


(since e* < 1 + x + (e — 2)x? whenever x < 1) 


— yy €-3r č 
nfi- —— YU,Y (vi ) Xi, Yò? pi 
o( fps (i, Y,) ( Pit N + = 3 (i ne] 


IA 


nw yy e-n oe. vo 
$a LY (Ps <) + = LY) Dis 
Summing the inequality over t = 1,..., and comparing the obtained upper and lower 


bounds for In(W,, /Wo), we have 


n N 

$ Eni A 

t=1 i=1 a f y i , i 
ESY) i +A = ye -2Y Y eG, YY pis DDD yee 


t=1 i=1 t=1 i=1 


n N 
OM ye 9 We Yi) Pia + Ee (6.2) 


t=1 i=1 


IA 


Note that we used the fact that nei , Y,)| < 1 for all £ and i. This condition is satisfied 
whenever Nk*n/y < 1, which holds for our choice of n, y, and n. Note that the assumption 
on n guarantees that n < 1. Furthermore, since our choice of y implies that for y > 1 the 
bound stated in the theorem holds trivially, we also assume y < 1. 

Now, recalling that E, Ci ,Y,) = £0, Y,), we get, after some trivial upper bounding, 


n In os 
5 [Eea ro — min EL jn s= N+ yn + ne yy [pi ta, Ys"), 
t= 


ai t=1 i=l 


where we used L jn = $} Lj, Y) <n. 
The proof is concluded by handling the squared terms as follows: 


N = 2p; 2 2(,*X2 
iparnya So REG oP Ner ea 


Pj. Y 


j=1 


Our choice of 7 and y finally yields the statement. E 


As promised, the next result shows that an upper bound of the order n?/ holds not 
only for the expected regret but also with a large probability. The extension is not at all 
immediate, because it requires a careful analysis of the estimated losses. This is done by 
appropriate tail estimates for sums of martingale differences. The main result of this section 
is the following theorem. 


Theorem 6.6. Consider a partial monitoring game (L, H) such that the loss and feedback 
matrices satisfy L = KH for some N x N matrix K, with k* = max{1, max;, j |k(i, DI} 
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Let 5 € (0, 1). Then the forecaster for partial monitoring, run with parameters 


InN \?? f (KN Inn \'? 
= an = — 
n= \INnk Y 4n 


1 ( s 
n> In 


satisfies, for all 


k* N 


and with probability at least 1 — ô, 


pra 


XN +3 
< 10(Nnk*)? (nN), fin Han 


The starting point of the proof is inequality (6.2). We show that, with an overwhelming 
probability, the right-hand side is less than something of the order n?⁄ and that the left-hand 
side is close to the actual regret 


ose 


Our main tool is the martingale inequality in Lemma A.8. This inequality implies the 
following two lemmas. Recall the notation 


N N 
Ep, Y) =) > piel, Y) and p, YD = Y} pis č, Yo). 


i=l i=l 
Lemma 6.3. With probability at least 1 — 6/(N + 3), 


oa oy N+3 
DOUP, Y) < YO UP, Yi) + 2k*N Cin , 


ô 
t=1 t=1 


Proof. Define X, = Lp, Y,)— L(p,. Y,). Clearly, E, X, = 0 and the X,’s form a martin- 
gale difference sequence with respect to 71, /2,.... Since 


n [X] = D> piipje B RE, YNG, Yo] 
i,j 


N 


k(i, m)k(j, m)h(m, Y}? 
=) PiuPja d Pma L, 
iJ 


2 
Pm,t 


m=1 


we have £? = pan u LX?] < n(k*)?N7/y. Thus, the statement follows directly by 
Lemma A.8 and the assumption onn. W 
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Lemma 6.4. For each fixed j, with probability at least 1 — 5/(N + 3), 


in Lal <28* | tn 243) 
jn jn eS a oie 


Proof. Just as in the previous lemma, use Lemma A.8 for the martingale difference 
sequence £(j, Y;) — (j, Y;) to obtain the stated inequality. E 


The next two lemmas are easy consequences of the Hoeffding—Azuma inequality. In 
particular, the following result is an immediate corollary of Lemma A.7. 


Lemma 6.5. With probability at least 1 — 6/(N + 3), 


ileal a 
ô 


IOE < Lært 


Lemma 6.6. With probability at least 1 — 6/(N + 3), 


n N 
z Nk*)? Nk*)? N+3 
ey pee O a a 
t=1 i=1 l Y y? 2 ô 


Proof. Let X, = Y3 Ki, Y: pis. Since 0 < X, < (Nk*/y}?, Lemma A.7 implies that 


Nk* n N+3 5 
P| ov In < ; 
2 5 N +3 


where V, = X, — E, X,+. The proof is now concluded by using E, X; < (Nk*)"/y, as already 
proved in (6.3). i 


The proof of the main result is now an easy combination of these lemmas. 


Proof of Theorem 6.6. The condition Nk*n/y < 1 is satisfied with the proposed choices 
of 7 and y, thus (6.2) is still valid. By Lemmas 6.3, 6.4, 6.5, and 6.6, combined with the 
union-of-events bound, with probability at least 1 — 6 we have, simultaneously, 


N+3 


5z €(p,,¥1) < ie, Y) +2 Nf Zin (6.4) 


t=1 t=1 


XAN +3 
IL jin — Ejn| < 2N | =I a (6.5) 
Y: 


Nk 3 
ô 


Yat, Y,) < Dre, Y,)+ (6.6) 
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and 


n N 


i Nk*)? Nk*)? N +3 
SO 2i,Y? pu < | Leys ) ” in + . 


6.7 
Y y? y2 ô SR 


t=1 i=1 


On this event, we can bound the regret as follows: 


Sell, Y,)— ai Lin 


ca! HRS 


N+3 
<5 LP, Y). Pie Lint sin 
rar doe 


(by (6.6)) 


IA 


DOTY) - min Zig + AN [ens [Fin ey 63) 


t=1 


InN ” |n nN +3 
BM se OW. Y” Pit + ny + 4k*N T Anse) ae ) 
t=1 i=1 


n N+3 
+5 In (by (6.2) and (6.5)) 
InN Nk*y NK |n N+3 
< + n(e ay! n+n(e ay! 3 In +ny 
n Y Y 2 ô 


AN +3 N+3 
ae ty OS in Ee ED), 
F 3 ae 


Resubstituting the proposed values of y and 7, and overapproximating, gives the claimed 
result. W 


Theorem 6.6 guarantees the existence of a forecasting strategy with a regret of order 
n?’ whenever the rank of the feedback matrix H is not smaller than the rank of the 
matrix . (Note that Hannan consistency may be achieved under the same condition, 
see Exercise 6.5.) This condition is satisfied for a wide class of problems. As an illustration, 
we consider the examples described in Section 6.4. 


Example 6.5 (Multi-armed bandit problem). Recall that in the case of the multi-armed 
bandit problem the condition of Theorem 6.6 is trivially satisfied because H = L. Indeed, 
one may take K to be the identity matrix so that k* = 1. In a subsequent section we show 
that in this case significantly smaller regrets may be achieved than those guaranteed by 
Theorem 6.6 by a carefully designed prediction algorithm. 
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Example 6.6 (Dynamic pricing). In the dynamic pricing problem, described in the intro- 
duction of this chapter, the feedback matrix is given by h(i, j) = a I< + bus j} for some 
arbitrarily chosen values of a and b. By choosing, for example, a = 1 and b = 0, it is clear 
that H is an invertible matrix and therefore one may choose K = L- H™! and obtain a 
Hannan-consistent strategy with regret of order n7/>. Thus, the seller has a way of selecting 
the prices J, such that his loss is not much larger than what could have been achieved had 
he known the values Y, of all customers and offered the best constant price. Note that with 
this choice of a and b the value of k* equals 1 (i.e., does not depend on N). Therefore the 
upper bound for the regret in terms of n and N is of the order (nN log N)?/>. Note also 
that the defining expression of K does not assume any special form for the loss matrix L. 
Hence, whenever H is invertible, Hannan consistency is achieved no matter how the loss 
matrix is chosen. 


Example 6.7 (Apple tasting). Inthe apple tasting problem described earlier one may choose 
the feedback values a = b = 1 and c = 0. This makes the feedback matrix invertible and, 
once again, Theorem 6.6 applies. 


Example 6.8 (Label efficient prediction). Recall next the variant of the label efficient 
prediction problem in which the loss and feedback matrices are 


b 


S 


1 1 a 
L=,|0 1 and H=]|c 
1 0 


S 


Here the rank of L equals 2, and so it suffices to encode the feedback matrix such that its 
rank equals 2. One possibility is to choose a = 1/2, b = 1, and c = 1/4. Then we have 
L = KH for 


0 2 2 
K=| 2 —2 -2 
2 4 4 


This example of label efficient prediction reveals that the bound of Theorems 6.5 and 6.6 
cannot be improved significantly in general. In particular, there exist partial monitoring 
games for which the conditions of the theorems hold, yet the regret is Q(n7/?). This follows 
by a slight modification of the proof of Theorem 6.4 by noting that the lower bound 
obtained in the setting of label efficient prediction holds in the earlier example in the partial 
monitoring problem. More precisely, we have the following. 


Theorem 6.7. For any label efficient forecaster there exists a sequence of outcomes 
yı, Y2,--. such that, for all large enough n, the forecaster’s expected regret satisfies 


Satan 


where c is a universal constant. 
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The proof is left to the reader as an exercise. 


Remark 6.3. Even though it is not possible to improve the dependence in n of the upper 
bounds of Theorems 6.5 and 6.6 in general, in some special cases a rate of O(n7'/?) 
is achievable. A main example of this situation is the multi-armed bandit problem (see 
Section 6.8). Another instance is described in Exercise 6.7. Note also that the bound of 
Theorem 6.6 does not depend explicitly on the value of the cardinality M of the set of 
outcomes, but a dependence on M may be hidden in k*. However, in some important 
special cases, such as the multi-armed bandit problem, this value is independent of M. In 
such cases the result extends easily to infinite sets Y of outcomes. In particular, the case 
when the loss matrix changes with time can be treated this way. 


6.6 Hannan Consistency and Partial Monitoring 


In this section we discuss conditions on the loss and feedback matrices under which 
it is possible to construct Hannan-consistent forecasting strategies. Theorem 6.6 (more 
precisely, Exercise 6.5) states that whenever there is an encoding of the values of the 
feedback such that the rank of the loss-feedback matrix x does not exceed that of the 
feedback matrix H, then Hannan consistency is achievable. 

This condition is basically sufficient and necessary for obtaining Hannan consistency 
if either N = 2 (i.e., the predictor has two actions to choose from) or M = 2 (i.e., the 
outcomes are binary). When N = 2 or M = 2, then, apart from trivial cases, Hannan 
consistency is impossible to achieve if L cannot be written as K H for some encoding of 
the feedback matrix H and for some matrix K (for the precise statements, see Exercises 6.8 
and 6.9). 

In general, however, it is not true that the existence of a Hannan consistent predictor is 
guaranteed if and only if the loss matrix L can be expressed as K H. To see this, consider 
the following simple example. 


Example 6.9. Let N = M = 3 and 
0 1 1 a b c 
L=|1 0 1 and H=|d d d 
1 1 0 e e e 


Clearly, for all choices of the numbers a, b, c, d, e, the rank of the feedback matrix is at 
most 2 and therefore there is no matrix K for which L = K H. However, note that whenever 
the first action is played, the forecaster has full information about the outcome Y,. Formally, 
an action į € {1,..., N} is said to be revealing for a feedback matrix H if all entries in the 
ith row of H are different. We now prove the existence of a Hannan-consistent forecaster 
for all problems in which there exists a revealing action. 


We start by introducing our forecasting strategy for partial monitoring games with 
revealing actions. 
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A FORECASTER FOR REVEALING ACTIONS 
Parameters: 0 < £ < 1 and ņ > 0. Action r is revealing. 
Initialization: wọ = (1,..., 1). 
For each round t = 1,2,..., 


(1) draw an action J, from {1,..., N} according to the distribution p;i: = 
Wir—-1/Wit-1 +++ Fw s-i) fori = 1,...,.N; 

(2) draw a Bernoulli random variable Z, such that P[Z, = 1] = g; 

(3) if Z, = 1, then play the revealing action, /, = r, observe Y,, and compute 


Wit = Wire 0YE for each i = 1,...,N; 


(4) otherwise, play /, = J, and let w;,, = wi -1 for each i = 1,..., N. 


Theorem 6.8. Consider a partial monitoring game (L, H) such that L has a revealing 
action. Let ô € (0, 1). If the forecaster for revealing actions is run with parameters 


m — /2m ln(4/ô8) 2e ln N 
€ = max { 0, — and n= ; 
n n 


where m = (4n)2/3(In(4N /8))'/3, then 


rye 4Nn \'3 
= (XOA, YD- min Lin) < 807!" (mn) 
n i=l,...,N ô 


a e SA 


holds with probability at least 1 — ô. 


Proof. The proof is a straightforward adaptation of the proof of Theorem 6.2 for the label 
efficient forecaster. In particular, the forecaster for revealing actions essentially coincides 
with the label efficient forecaster in Section 6.2. Indeed, Theorem 6.2 ensures that, with 
probability at least 1 — 6, not more than m among the Z, have value 1 and that the regret 


accumulated over those rounds with Z, = 0 is bounded by 8n,/ L In w, Since each time a 
revealing action is chosen the loss suffered is at most 1, this in turn implies that 


Substituting the proposed value for the parameter m concludes the proof. W 


Remark 6.4 (Dependence on the number of actions). Observe that, even when the condi- 
tion of Theorem 6.6 is satisfied, the bound of Theorem 6.8 is considerably tighter. Indeed, 
even though the dependence on the time horizon n is identical in both bounds (of order 
n~'/3), the bound of Theorem 6.8 depends on the number of actions N in a logarithmic way 
only. As an example, consider the case of the multi-armed bandit problem. Recall that here 
H = Land there is a revealing action if and only if the loss matrix has a row whose elements 
are all different. In such a case Theorem 6.8 provides a bound of order (dn N)/ n) 1/3 On 
the other hand, there exist bandit problems for which, if N < n, it is impossible to achieve 
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a per-round regret smaller than av N /n (see Theorem 6.11). If N is large, the logarithmic 
dependence of Theorem 6.8 thus gives a considerable advantage. 


Interestingly, even if L cannot be expressed as K H, the existence of a revealing action 
ensures that the general forecaster of Section 6.5 may be used to achieve a small regret. 
This may be done by first converting the problem into another partial monitoring problem 
for which the general forecaster can be used. The basic step of this conversion is to replace 
the pair (L, H) of N x M matrices by a pair (L’, H’) of mN x M matrices where m < M 
denotes the cardinality of the set S = {s),..., Sm} of signals (i.e., the number of distinct 
elements of the matrix H). In the obtained prediction problem the forecaster chooses 
among mN actions at each time instance. The converted loss matrix L’ is obtained simply 
by repeating each row of the original loss matrix m times. The new feedback matrix H’ is 
binary and is defined by 


H'(mG — 1) + k, j) = Iina, ps)» i=l,...,N, k=1,...,m, J=1,...,M. 


Note that this way we get rid of the inconvenient problem of how to encode, in a natural 
way, the feedback symbols. If the matrices 


1 
H’ and E | 
have the same rank, then there exists a matrix K’ such that L’ = K’ H’ and the forecaster of 
Section 6.5 may be applied to obtain a forecaster that has an average regret of order n~! 
for the converted problem. However, it is easy to see that any forecaster A with such a 
bounded regret for the converted problem may be trivially transformed into a forecaster A’ 
for the original problem with the same regret bound: A’ simply takes an action i whenever 
A takes an action of the form m(@i — 1) + k for any k = 1,...,m. 

The conversion procedure guarantees Hannan consistency for a large class of partial 
monitoring problems. For example, if the original problem has a revealing action 7, then 
m = M and the M x M submatrix formed by the rows M(i — 1) +1,..., Mi of H’ is the 
identity matrix (up to some permutations over the rows) and therefore has full rank. Then 
obviously a matrix K’ with the desired property exists and the procedure described above 
leads to a forecaster with an average regret of order n~!/3. 

This last statement may be generalized in a straightforward way to an even larger class 
of problems as follows. 


Corollary 6.2 (Distinguishing actions). Assume that the feedback matrix H is such that 
for each outcome j =1,...,M there exists an action i € {1,...,.N} such that for all 
outcomes j' # j, h(i, j) # h(i, j^). Then the conversion procedure described above leads 
to a Hannan-consistent forecaster with an average regret of order n~'/?, 


The rank of H’ may be considered as a measure of the information provided by 
the feedback. The highest possible value is achieved by matrices H’ with rank M. For 
such feedback matrices, Hannan-consistency may be achieved for all associated loss 
matrices L’. 

Even though the above conversion strategy applies to a large class of problems, the asso- 
ciated condition fails to characterize the set of pairs (L, H) for which a Hannan consistent 
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forecaster exists. Indeed, Piccolboni and Schindelhauer [234] show a second simple con- 
version of the pair (L’, H’) that can be applied in situations when there is no matrix K’ 
with the property L’ = K’ H’ (this second conversion basically deals with some actions 
that they define as “useless”’). In these situations a Hannan-consistent procedure may be 
constructed on the basis of the forecaster of Section 6.5. On the other hand, Piccolboni 
and Schindelhauer also show that if the condition of Theorem 6.6 is not satisfied after the 
second step of conversion, then there exists an external randomization over the sequences of 
outcomes such that the sequence of expected regrets grows at least as n, where the expec- 
tations are understood with respect to the forecaster’s auxiliary randomization and the 
external randomization. Thus, a proof by contradiction using the dominated-convergence 
theorem shows that Hannan consistency is impossible to achieve in these cases. This result, 
combined with Theorem 6.6, implies the following gap theorem. 


Corollary 6.3. Consider a partial monitoring game (L, H). If Hannan consistency can be 

achieved, then there exists a Hannan-consistent forecaster whose average regret vanishes 
3 -1/3 

at rate n : 


Thus, whenever it is possible to force the average regret to converge to 0, a convergence 
rate of order n™!⁄ is also possible. 


We close this section by pointing out a situation in which Hannan consistency is impos- 
sible to achieve. 


Example 6.10. Consider a case with N = M = 3 and 
01 1 

L=]1 01 and H=] a 
1 1 0 


In this example, the second and third outcomes are indistinguishable for the forecaster. 
Obviously, Hannan consistency is impossible to achieve in this case. However, it is easy to 
construct a strategy for which 


1/< Lon + Lan 
- D Y) — min (Lin 2) = o(1), 
n 2 


with probability 1 (see Exercise 6.10 for a somewhat more general statement). 


6.7 Multi-armed Bandit Problems 


This and the next two sections are dedicated to multi-armed bandit problems, which we 
define in Example 6.1 of Section 6.4. Recall that in this problem the forecaster, after making 
a prediction, learns his own loss ¢(/;, Y,) but not the value of the outcome Y;. Thus, the 
forecaster does not have access to the losses he would have suffered had he chosen a 
different action. The goal of the forecaster remains the same, which is to guarantee that 
his cumulative loss a , £U, Y;) is not much larger than the cumulative loss of the best 
action, minj=1,...v X1 K(i, Yp). 

In the classical formulation of multi-armed bandit problems (see, e.g., Robbins [245], 
Lai and Robbins [189]) it is assumed that, for each action i, the losses £(i, y1), ..., ECI, Yn) 
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are randomly and independently drawn according to a fixed but unknown distribution. In 
such a case, at the beginning of the game one may sample all arms to estimate the means 
of the loss distributions (this is called the exploration phase), and while the forecaster has 
a high level of confidence in the sharpness of the estimated values, one may keep choosing 
the action with the smallest estimated loss (the exploitation phase). Indeed, under mild 
conditions on the distribution, one may achieve that the per-round regret 


ly Ie ck N 5 
s 2M yi) = min 2M y) 
t= t= 


converges to 0 with probability 1, and delicate tradeoff between exploitation and exploration 
may be achieved under additional assumptions (see the exercises). 

Here we investigate the significantly more challenging problem when the outcomes 
Y, are generated in the nonoblivious opponent model. This variant has been called the 
nonstochastic (or adversarial) multi-armed bandit problem. Thus, the problem we consider 
is the same as in Section 4.2, but now the forecaster’s actions cannot depend on the past 
values of £(i, Y,) except when i = 1. 

As we have already observed in Example 6.1, the adversarial bandit problem is a special 
case of the prediction problem under partial monitoring defined in Section 6.4. In this 
special case the feedback matrix H equals the loss matrix L. Therefore, Theorems 6.5 
and 6.6 apply because one may use the general forecaster for partial monitoring with K 
taken as the identity matrix. The forecaster that, according to these theorems, achieves a 
per-round regret of order n—'/3 (and is, therefore, Hannan-consistent) takes, at time r, the 
action /, drawn according to the distribution p,, defined by 


eT "Lis X y 
ee e~lr- N 


Pix = (1-7) 


’ 


where L = aa 1 Ui , Y,) is the estimated cumulative loss and 


~ _ flGY)/pi ifl =i 
Me ils | 0 otherwise 
is used to estimate the loss £(i, Y,) at time ¢ for all i = 1,..., N. The parameters y, n € 


(0, 1) may be set according to the values specified in Theorems 6.5 and 6.6 (with k* = 1). 

Interestingly, even though the bounds of order n?⁄ established in Theorems 6.5 and 6.6 
are not improvable in general partial monitoring problems (see Theorem 6.7), in the special 
case of the multi-armed bandit problem, significantly better performance may be achieved 
by an appropriate modification of the forecaster described above. The details and the 
corresponding bound are shown in Section 6.8, and Section 6.9 establishes a matching 
lower bound. 

We close this section by mentioning that the forecaster strategy defined above, which 
may be viewed as an extension of the exponentially weighted average predictor, may be 
generalized. In particular, just as in Section 4.2, we allow the use of potential functions other 
than the exponential. This way we obtain a large family of Hannan-consistent forecasting 
strategies for the adversarial multi-armed bandit problem. 
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Recall that weighted-average randomized strategies were defined such that at time t, the 
forecaster chooses action i randomly with probability 


V; ®(R;_1) = '(Rit-1) 
PR) eRe) 


where ® is an appropriate potential function defined in Section 4.2. In the bandit problem 
this strategy is not feasible because the regrets Rj -1 = ya (€(p,. Y,) — €(i, Y. s)) are not 


Pit = 


available to the forecaster. The components r; = Lp, Y,) — (i, Y,) of the regret vector 
r, are substituted by the components 7;,, = €U;, Y;) — Ui , Y,) of the estimated regret ¥;. 
Finally, at time ¢ an action J, is chosen from the set {1,..., N } randomly according to the 
distribution p,, defined by 


Q (Rir 1) ve Vt 
ye (Ree) N’ 


where y, € (0, 1) is a nonincreasing sequence of constants that we typically choose to 
converge to 0 at a certain rate. 

First of all note that the defined prediction strategy is feasible because at time ¢ it only 
depends on the past losses €(/;, Y;), s = 1,...,¢ — 1. Note that the estimated regret T, is 
an “unbiased” estimate of the regret r, in the sense that E [Fixe Hirss L-1] = 

Observe also that, apart from the necessary modification of the notion of regret, we have 
also introduced the constants y, whose role is to keep the probabilities p;, far away from 
0 for all i, which is necessary to make sure that a sufficient amount of time is spent for 
“exploration.” 

The next result states Hannan consistency of the strategy defined above under general 
conditions on the potential function. 


Pit =A-%) 


Theorem 6.9. Let 


N 
du) =v (> su) 
i=1 


be a potential function. Assume that 
(i) X 1/y? = o(n?/ Inn); 


(ii) For all vectors V; = (Vi t, ---.Vn,t) with |vir| < N/V, 


li C(v;) =0 
i ea 


where C is the function defined in Theorem 2.1; 


(iii) for all vectors u, = (u11, ..., Un t), With uj, < t, 
1 n N 
lim ———— y V; ®(u,) = 0 
n>% (p(n) 2 2 i 


(iv) for all vectors u, = (u1 t, ..., Un), with tit < t, 


i lnn 
2 Viðm )] =0 
a Yon) t=1 a i=1 ) 
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Then the potential-based prediction strategy defined above satisfies 


1 


lim = (£ U, Y) — min, Da ro) =0, 


E 


with probability 1. 


The proof is an appropriate extension of the proof of Theorem 2.1 and is left as a guided 
exercise (see Exercise 6.19). The theorem merely states Hannan consistency, but the rates 
of convergence that may be deduced from the proof are far from being optimal. To achieve 
the best rates the martingale inequalities used in the proof should be considerably refined, 
as is done in the next section. Two concrete examples follow. 


Exponential Potentials 
In the special case of exponential potentials of the form 


1 N 
®,(u) = —In (3 om 
1 i=1 


the forecaster of Theorem 6.9 coincides with that described at the beginning of the section. 
Its Hannan consistency may be deduced from both Theorem 6.9 and Theorem 6.6. 


Polynomial Potentials 
Recall, from Chapter 2, the polynomial potentials of the form 


N 2/p 
©, (u) = (Zex) = [u4 l2, 
i=1 


where p > 2. Here one may take d(x) = A and w(x) = x?/?. In this case the conditions of 
Theorem 6.9 may be checked easily. First note that Y (ġ(n)) = n?. Recall from Section 2.1 
that for any u € R”, C(u) < 2(p — 1) lul. Thus for condition (11) of Theorem 6.9 to hold, 
it suffices that )~"_,(1/ y7) = o(n?°), which is implied by condition (i). To verify conditions 
(iii) and (iv) note that, for any u = (u1, ..., uy) € RY, 


N p-1 p—1 
2\ju4 Ilp- 2N? |u ||} 
X Vidu) = a = l ae = 2N” |u; ||, 
i=1 u+ Ilp llu- Ilp 


where u, = ((u1)+, -.., (un)+) and we applied Hölder’s inequality. Thus, conditions 
(iii) and (iv) are satisfied whenever (1/n?) X7; t y > 0 and (Inn/n?),/~"_, t/y > 
0, respectively. These two conditions, together with condition (i) requiring 
(Inn/n?) yr / y?) — 0, are sufficient to guarantee Hannan consistency of the potential- 
based strategy for any polynomial potential with p > 2. Taking, for example, y, = t°, it is 
easy to check that all three conditions are satisfied for any a € (—1/2, 0). 
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6.8 An Improved Bandit Strategy 


In the previous sections we presented several forecasting strategies that guarantee Hannan- 
consistent prediction in the nonstochastic multi-armed bandit problem. The rates of con- 
vergence that may be obtained in a straightforward manner from Theorems 6.5 and 6.9 are, 
however, suboptimal. The purpose of this section is to introduce a slight modification in 
the forecaster of the previous section and achieve optimal rates of convergence. 

We only consider exponential potentials and assume that the horizon (i.e., the total 
length n of the sequence to be predicted) is fixed and known in advance. The extension to 
unbounded or unknown horizon is, by now, a routine exercise. 

Two modifications of the strategy described at the beginning of the previous section are 
necessary to achieve better rates of convergence. First of all, the modified strategy estimates 
gains instead of losses. For convenience, we introduce the notation g(i, Y;) = 1 — £(i, Y,) 
and the estimated gains 


~ i=, ai, Y,)/ Pit ifl, =i ae, 
Oe | 0 otherwise, Eer N 
Note that E[ &(i, Y;) | i, ..., L-1] = g(i, Y+), and therefore @(i, Y,) is an unbiased estimate 


of g(i, Y,) (different from 1 — £(i, Y,) used in the previous section). The reason for this 
modification is subtle. With the modified estimate the difference g(i, t) — (i, t) is bounded 
by 1 from above, which is used in the martingale-type bound of Lemma 6.7 below. 

The forecasting strategy is defined as follows. 


A STRATEGY FOR THE MULTI-ARMED 
BANDIT PROBLEM 


Parameters: Number of actions N , positive reals 6, n, y < 1. 
Initialization: w; o = 1 and p;,; = 1/N fori = 1,..., N. 
For each roundt = 1,2... 


(1) select an action J, € {1,..., N} according to the probability distribution p,; 
(2) calculate the estimated gains 


iY) <3aY+F a Oe if I) =i 


it B/ Pit otherwise; 
(3) update the weights w; = wi,—1e"8 ©"); 
(4) calculate the updated probability distribution 


Wit Y ; 
-+ —, Ee lhea N: 
W, N 


Pin =A-y) 


Another modification is that instead of an unbiased estimate, a slightly larger quantity is 
used by the strategy. To this end, the strategy uses the quantities g'(i, Y;) = (i, Y;) + B/Pi.t» 
where £ is a positive parameter whose value is to be determined later. Note that we give up 
the unbiasedness of the estimate to guarantee that the estimated cumulative gains are, with 
large probability, not much smaller than the actual (unknown) cumulative gains. Thus, the 
new estimate may be interpreted as an upper confidence bound on the gain. 
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The next theorem shows that, as a function of the number of rounds n, the regret is 
of the same order O(./n) of magnitude as in the “full information” case, that is, in the 
problem of randomized prediction with expert advice discussed in Section 4.2. The price 
one has to pay for not being able to measure the loss of the actions (except for the selected 
one) is in the dependence on the number N of actions. The bound derived in Theorem 
6.10 is proportional to VN In N, as opposed to the “full information” bound, which only 
grows as InN with the number of actions. Thus, the bound is significantly better than 
that implied by Theorem 6.6, which only guarantees a bound of order (nN)?/7(In N)!/3. 
In Section 6.9 we show that the “nN InN upper bound shown here is optimal up to a 
factor of order vIn N . Thus, the per-round regret is about the order ./In N /(n/N), opposed 
to ./dn N)/n achieved in the full information case. This bound reflects the fact that the 
available information in the bandit problem is the Nth part of that of the original, full 
information case. This phenomenon is similar to the one we observed in the case of label 
efficient prediction, where the per-round regret was of the form /In N/m. 


Theorem 6.10. For any 6 € (0, 1) and for any n > 8N In(N/6), if the forecaster for the 
multi-armed bandit problem is run with parameters 


1 y 1 N 
YS> O<n Soy and TT $S! 


then, with probability at least 1 — ô, the regret satisfies 


+s : InN 
L, — min Lin <n(y +n(1 + BN) + — + 2nNB. 
n 


i= 


{chs 


In particular, choosing 


one has 


È a kea NON) 
n— min Lin < — /nN ln peas 
N° 2 2 


The condition on n ensures that 4NB/(3 + $) = y is at most 1/2 for the stated choice 
of £. 

The analysis of Theorem 6.10 is similar, in spirit, to the proof of Theorem 6.6. The 
essential novelty necessary to obtain the refined bounds is a more careful bounding of the 


difference between the estimated and true cumulative losses of each action. 
Introduce the notation 


Gin= 8Y) and Gin = 86 Y). 
t=1 t=1 
The following lemma is the key to the proof of Theorem 6.10. 


Lemma 6.7. Let 5 € (0, 1). For any B € [/In(V/5)/(aN), 1] andi € {1, ..., N}, we have 
P [Gin > Gip, + BuN] <8/N. 
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Proof. 


P[Gin > Gi, + BAN] =P[Gin — Gi, > Ba] 


[exp (B (Gin — Gi n))] exp (~8°nN) 
(by Markov’s inequality). 


IA 


Observe now that since B > /In(V/5)/(nN), exp (—f?nN) < 6/N, and therefore it suf- 


fices to prove that [exp (8 (Gin = G;,))] < 1. Introducing for t = 1, ..., the random 
variable 


Z, = exp(B(Gir — Gip), 


we Clearly have 


Z, = exp (6 (s4. Y.) — 30, Y,) — £ )) Z;-1.- 


Next, for t = 2, ..., n, we bound E[Z, | ,..., /;-1] = E, Z; as follows: 


7) 


os $ od or z Aari 2 
< Ze P/E, [1+ B(G Y) -FG Yo) + B°(. Y) -36 Y) | 
(since B < 1, g(i, Y,) — 20, Y;) < 1l and e* < 1 +x +x? for x < 1) 
Sia a % Aa 2 
= Zier E, |1 + B? (gG, Y) -Zi Yo) | 
(since E, [g(i, Y) — 8, Y,)] = 0) 
2 B 
< Zef /Pi.t (: + £) 
Pit 
(since E, [(g@, Y) — ZG, YDY] < E; [RG Y] < 1/pin) 
<Z,_; (since 1+x <e“). 


u Zi = Z1 E; [ew (5 (g0. Y) -36 Y9- 


Taking expected values of both sides of the inequality, we have E Z, < E Z,—ı. Because 
UZ, < 1, we obtain EZ, < 1, as desired. W 


Proof of Theorem 6.10. Let W, = pane wis. Note first that 


w N 
In — = In e"Gin | — InN 
n) 


>In{ max e’%) —InN 
i=l,..,N 


ein 


6.8 An Improved Bandit Strategy 163 


On the other hand, for each t = 1,...,”, since 6B <1 and yn < y/(2N) imply that 
ng'(i, Y;) < 1, we may write 


wW Yw 
nenpt 
Wi-1 = Wi 


N 
=inys Pere YIN nao 


N 
it = Y/N ios is 
< no (ht GY) + n?g' (i, VY) 
(since e* < 1 +x +x? for x < 1) 


N 2 N 
n Ips n iza 
enhi Eran Euser?) 


i=1 i=1 


(since Di- (Pit —y/N)=1-y) 


n? N 
a TSi Cies YO piu g'i, Yo? 
i=] i=1 


nes Indi + x) < x forall x > —1). 


Observe now that, by the definition of g’(i, Y,), 


N 
Yo Pir g'i Y) = 80, Yi) + NB 


i=l 


and that 
N N 
is iis (iY) Æ 
> Pit g'(i, VY = 5 Pit 8 (i, Y,) (tu? a+ 
i=l = Pi,t Pi,t 
N 
= g'h, Vg. Y) +BY gi Y) 
i=1 
N 
< (1+ 8)>)8'G,¥). 
i=l 
Substituting into the upper bound, summing over t = 1,...,n, and writing G, = 


P- ed, Y,), we obtain 


W, 21 b 
ee ee + [ay nNB + jE ei. 
Wo l-y lone. & 


Comparing the upper and lower bounds for In(W,, / Wo) and rearranging, 
N 
x (l—y)InN 
no 1 = f = N 1 i 
G,—(—y) max Gi, > ; nNB (+B) Gi, 


=1,..., ; 
i=l 


InN 
——— —nNB—-7n(1+ B)N max Gin 
n 


yis 
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that is, 


G, > (1-y —n(1+)N) max G! — —— —nNB. 
n 


; ijn 
i=l aN. z 


/ 
max G;, = max Gin— BnN 
N EINN 


ETAN 


whenever ~VIn(N /8)/(nN) < B < 1. Thus, with probability at least 1 — 5, we have 


a InN 

Go (Sya EAN] max, Cin — — —aNBQ—y — n+ BN). 
In terms of the losses om =n— G, and Lin =n — Gin, and noting that 1 — y — n(1 + 
B)N > 0 by the choice of the parameters, the inequality is rewritten as 


oe . InN 


ees 


sea 


as desired. W 


Remark 6.5 (Gains vs. losses). It is worth noting that there exists a fundamental asym- 
metry in the non-stochastic multi-armed bandit problem considered here. The fact that our 
randomized strategies sample with overwhelming probability the action with the currently 
best estimate makes, in a certain sense, the game with losses easier than the game with 
gains. The following simple argument explains why: with losses, if the loss of the action 
more often sampled starts to increase, its sampling probability drops quickly. With gains, if 
some action sampled with overwhelming probability becomes a bad action (yielding small 
gains), its sampling probability simply ceases to grow. 


Remark 6.6 (A technical note). One may be tempted to try to generalize the argument 
of Theorem 6.10 to other problems of prediction under partial monitoring. By doing that, 
one quickly sees that the key property of the bandit problem, which allows to obtain the 
regret bound of order ./n, is that the quadratic term 4 pi «8'(i, Y;)? can be bounded by 
a random variable whose expected value is at most proportional to N. The corresponding 
term in the proof of Theorem 6.5, dealing with the general partial monitoring problem, 
could only be bounded by something of the order N?(k*)?/y, making the regret bound of 
order n?’ inevitable. 


6.9 Lower Bounds for the Bandit Problem 


Next we present a simple lower bound for the performance of any multi-armed bandit 
strategy. The main result of the section shows that the upper bound derived in the previous 
section cannot be improved by more than logarithmic factors. In particular, it is shown 
below that the regret of any prediction strategy (randomized or not) in the nonstochastic 
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multi-armed bandit problem can be as large as Q(/nN). The dependence on time n 
is similar to the “full information” case in which the best obtainable bound is of order 
VninN. The main message of the next theorem is that the multi-armed bandit problem is 
essentially more difficult than the simple randomized prediction problem of Section 4.2 in 
that the dependence on the number of actions (or experts) is much heavier. This is due to 
the fact that, because of the limited information available in bandit problems, an important 
part of the effort has to be devoted to exploration, that is, to the estimation of the cumulative 
losses of each action. 


Theorem 6.11. Let n, N > 1 be such that n > N/(41n(4/3)), and assume that the cardi- 
nality M = |V| of the outcome space is at least 2" . There exists a loss function such that 
for any, possibly randomized, prediction strategy 


i=l,..., 


se J2—1 
su La — min Lin | > vaN ——. 
ae ( vie ) J32InG/3) 


Proof. First we prove the theorem for deterministic strategies. The general case will 
follow by a simple argument. The main idea of the proof is to show that there exists a 
loss function and random choices of the outcomes y, such that for any prediction strategy, 
the expected regret is large (here expectation is understood with respect to the random 
choice of the y,). More precisely, we describe a probability distribution for the random 
choice of £(i, y). It is easy to see that if M > 2", then there exists a loss function £ 
and a distribution for y, such that the distribution of (i, y,) is as described next. Let 


li, y) = Xini = 1,...,N,t = 1,...,n be random variables whose joint distribution is 
defined as follows: let Z be uniformly distributed on {1,..., N}. For each i, given Z = i, 
Xj1,-++,Xj,, are conditionally independent Bernoulli random variables with parameter 


1/2 if j Ai and with parameter 1/2 — e if j = i, where the value of the positive parameter 
€ < 1/4 is specified below. Then obviously, for any (non-randomized) prediction strategy, 


sup {L,— min L;,)>E/L,— min Lin], 
yreyn N F N 


pig 


where the expectation on the right-hand side is now with respect to the random variables 
Xi «1. Thus, it suffices to bound, from below, the expected regret for the randomly chosen 
losses. First observe that 


N 
Ez L; | = PZ = j] E Lin Z= j| 

J= 
1 N 

L [Lin] Z= j] 

J= 

n 

=- -ne 
2 


The nontrivial part is to bound, from below, the expected loss E L, of an arbitrary prediction 
strategy. To this end, fix a (deterministic) prediction strategy, and let J, denote the action 
it chooses at time t. Clearly, J; is determined by the losses X7,.1,..., X7,_,,:-1. Also, let 
T; = 0’; Iu,=;; be the number of times action j is played by the strategy. Then, writing 
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ù; for E[- | Z = i], we may write 


N 
A 1 a 
Ln = N y iL, (by symmetry) 


1 N n N 
a WEDD Xj lus 
i=l t=l j=l 
1 N n N 
ie `» Os vi E; [X je Tay | Xn. ---. X raa] 
i=l t=1 j=l 
1 N n N 
= oD WE X js E; Iy, 
i=l t=1 j=l 
(since /, is determined by X7,1,..., X1_,.+-1) 
n he 
ee as 


Hence, 


N , 1 i 
) E = nin, Lin] >e (> -N 2 m) 


i=l 


and the proof reduces to bounding E;T; from above, that is, to showing that if ¢ is suffi- 
ciently small, the best action cannot be chosen too many times by any prediction strategy. 
We do this by comparing E;T; with the expected number of times action 7 is played 
by the same prediction strategy when the distribution of the losses of all actions are 
Bernoulli with parameter 1/2. To this end, introduce the i.i.d. random variables X i , such 
that PLX f ı = 0] = P[X ig = 1] = 1/2 and let T’ be the number of times action i is played 
by the prediction strategy when (j, y) = X if for all j and ft. Similarly, 7/ denotes the index 
of the action played at time f under the losses X TE Writing P; for the conditional distri- 
bution P[- | Z = i], introduce the probability distributions over the set of binary sequences 
b” = (bi, ..., bn) € {0, 1}", 


GO = P [Xni = bi, Xia = ba] 


and 


q'(b") = Pi[Xp 5 bi s Xin = ba). 


Note that q'(b”) = 2~” for all b”. Observe that for any b” € {0, 1}", 


uT | X11 = bi- Xin = ba] = E[T; | Xp) = bi -s Xin = br], 


and therefore we may write 


LT = iT! 


$ 
= gO El | Xni = bi- Xpm = bn] 
bre(0,1}" 
= > q'(b") a[ 7 [Xia = piot Xn = bn| 
bre(0,1}" 
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= x (q(b") — q'(b"))Ei[T; | Xn = bis- Xn = bn] 
bn e{0,1}" 


< 2 (q(b") — q'(b"))Ei[T; | Xn = bis -5 Xin = bn] 
bn :q(b)>q'b") 


<n J, (Q-a) 
b" : q" y>q' b") 
(since UT Xna = bis... Xn n = br] < n). 


The expression on the right-hand side is n times the so-called total variation distance of the 
probability distributions q and q’. The total variation distance may be bounded conveniently 
by Pinsker’s inequality (see Section A.2), which states that 


/ n 1 
D @b")-4'&") < J 5P'la). 
bn :q(b")>q'(b") 
where 


q'(b") 
q(b") 


D@'igv= X` q'o”) 


b" e{0,1}" 
is the Kullback—Leibler divergence of the distributions g’ and q. Denoting 
G00 y= Pi[X r = Bal Xna = bie Raga = b] 
and 
qilb | BO") = Pi [Xi = bi | Xp =O Xy a1 = bm] 


by the chain rule for relative entropy (see Section A.2) we have 


n 


1 
Da'Ig= diosa 2, PEDD atd) 


t=1 b'—le{0,1} 7! 
STR 
=P a| È PED atb) 
t=1 bt-! Li 


+ $O Digicel bo fac |e) 


bth si 
(since b'™! determines the value /;). 
Clearly, if b‘~! is such that J, Æ i, then q/(- | b'~') and q,(- | b'~!) both are symmetric 
Bernoulli distributions and D(q/(- | b'~!) || q: | b‘~!)) = 0. On the other hand, if b’~! is 


such that J, =i, then q/(- | b'~') is a symmetric Bernoulli distribution and q,(- | b'~') is 
Bernoulli with parameter 1/2 — e. In this case 


D(g{C 1 6) qu |B) = -F m0 — 46%) < 8 m4/3)2?, 
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where we used the elementary inequality — In(1 — x) < 41n(4/3)x for x € [0, 1/4]. We 
have thus obtained 


/ : 1 
D(q'lq) < 81n(4/3)e" DD Teni 


t=1 bt-! 
Summarizing, we have 


Den-4)-4Den-er) 


i=l i=1 


N 
< SEn zsm = = thie i 


i=1 bt-! 


N 
~ 4 Ine a =n lig: Lhe 
i=1 


bt-! 


z| = 


IA 
3 


\ 


ey Jensen’s inequality) 
= nae} a Er a ye ys ee, 


bt-! 
[41n(4/3)n 
= ne,| ————., 
N 


and therefore 


Bounding 1/N < 1/2 and choosing € = ycN /n, with c = 1/(8In(4/3)) (which is guar- 
anteed to be less than 1/4 by the condition on n), we obtain 


I= 
4 > nN = 
=l, N J 32 In(4/3) 


as desired. This finishes the proof for deterministic strategies. To see that the result remains 
true for any randomized strategy, just note that any randomized strategy may be regarded as 
arandomized choice from a class of deterministic strategies. Since the inequality is true for 
any deterministic strategy, it must hold even if we average over all deterministic strategies 
according to the randomization. Then, by Fubini’s theorem, we find that the expected value 
(with respect to the random choice of the losses) of E L, — min;=1,...,N Li,n is bounded 
from below by Vn N (V2 — 1)/./32 (4/3) (where in EĈ, the expectation is meant only 
with respect to the randomization of the strategy). Thus, because the maximum is at least 
as large as the expected value, we conclude that 


J2-1 
ae Lin) > VaN 
sup ( anny iE "N Pina) 
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6.10 How to Select the Best Action 


So far we have described several regret-minimizing strategies that draw, at each step, an 
action according to a certain probability distribution over the possible actions. In this section 
we investigate a different scenario in which the randomized strategy performs a single draw 
from the set of possible actions and then is forced to use the drawn action at each subsequent 
step. However, before performing the draw, the strategy is allowed to observe the outcome 
of all actions for an arbitrary amount of time. The goal of the strategy is to pick as early as 
possible an action yielding a large total gain. (In this setup it is more convenient to work 
with gains rather than losses.) 

As a motivating application consider an entrepreneur who wants to start manufacturing 
a certain product. The production line can be configured to manufacture one of N different 
products. As reconfiguring the line has a certain fixed cost, the entrepreneur wants to choose 
a product whose sales will yield a total profit of at least d. The decision is based on a market 
analysis reporting the temporal evolution of potential sales for each of the N products. At 
the end of this “surveillance phase,” a single product is chosen to be manufactured. We cast 
this example in our game-theoretic setup by assigning gain x; to action i at time t if x;,, is 
the profit of the entrepreneur originating from the sales of product i in the rth time interval. 
If the t-th time interval falls in the surveillance phase, then x;,, is the profit the entrepreneur 
would have obtained had he chosen to manufacture product / at any time earlier than t (we 
make the simplifying assumption that each such potential profit is measured exactly in the 
surveillance phase). 

In this analysis we only consider actions yielding binary gains; that is, x;,, € {0, 1} 


for each t = 1,2... andi = 1,..., N. We assume gains are generated in the oblivious 
opponent model, specifying the binary gain of each action i = 1,..., N at each time step 
t = 1,2,... so that every action has a finite total gain. That is, for every i there exists n; 


such that x;,, = 0 for allt > n;. Let Gj, = xj, +--+ + Xit, and let G; = G;,,, be the total 
gain of action i. We write G* to denote max; G;. Because we are restricted to the oblivious 
opponent model, we may think of a gain assignment x; € {0, 1} fort = 1,..., n; and 
i =1,..., N being determined at the beginning of the game and identify the choice of the 
opponent with the choice of a specific gain assignment. 

Given an arbitrary and unknown gain assignment, at each time t = 1, 2, ... a selection 
strategy observes the tuple (x1,;,..., Xw,,) of action gains and decides whether to choose 
an action or not. If no action is chosen, then the next tuple is shown, otherwise, the game 
ends. Suppose the game ends at time step f when the strategy chooses action k. Then the 
total gain of the strategy is the remaining gain Gy — Gx, of action k (all gains up to time t 
are lost). 

The question we investigate here is how large the best total gain G* has to be such that 
some randomized selection algorithm guarantees (with large probability) a gain of at least 
d, where d is a fixed quantity. Obviously, G* has to be at least d, and the larger G* the 
“easier” it is to select an action with a large gain. On the other hand, it is clear intuitively 
that G* has to be substantially larger than d. Just observe that any selection strategy has a 
gain of d with probability at most 1/N on some game assignment satisfying G* = d. (Let 
X;,, = 1 for alli and t < d — 1, and let x; a = 0 for all but one action i.) The main result 
of this section shows that there exists a strategy gaining d with a large probability on every 
gain assignment satisfying G* = Q(d InN). More precisely, we show that the following 
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randomized selection strategy, for any givend > 1, gains at least d with probability at least 
1 — 6 on any gain assignment such that G* = Q (¢ In Ny, 


A RANDOMIZED STRATEGY TO SELECT 
THE BEST ACTION 


Parameters: real numbers a, b > 0. 
Initialization: G; o = 0 for i = 1,..., N. 
For each round ż = 1,2,... 
(1) foreach j = 1,..., N observe x; and compute G js = G jt-1 + Xj,15 
(2) determine the subset N, C {1,..., N} of actions j such that x P=; 
(3) for each j € M, draw a Bernoulli random variable Z; = Zj,¢,, such that 
P[Zj] = min{1, e97}; 
(4) if Z; =0,..., Zy = 0, then continue; otherwise output the smallest index 
i € N, such that Z; = 1 and exit. 


This bound cannot be significantly improved. In fact, we show that no selection strategy 
can gain more than d with probability at least 1 — 6 on all gain assignments satisfying 
G* =O (¢ InN ). Our randomized selection strategy uses, for each action 7, independent 
Bernoulli random variables Z;,;, Z;,2,... such that Z; x has parameter 


p(k) = min{1, e4*?} 


for a, b > 0 (to be specified later). 


Theorem 6.12. For alld > 1 and0 < 6 < 1, if the randomized strategy defined above is 
run with parameters a = 5/(6d) and b = lIn(6d N /ô), then, with probability at least 1 — ô, 
some action i with a remaining gain of at least d will be selected whenever the gain 


assignment is such that 
6 6N 3 
G* >d(1+-—In{ —In-—))}. 
ô ô ô 


Proof. Fix a gain assignment such that G* satisfies the condition of the theorem. We work 
in the sample space generated by the independent random variables Z; s fori = 1,..., N 
and s > 1. For each i and for each s > n; we set Zi s = 0 with probability 1 to simplify 
notation. Thus, Z; s can only be equal to 1 if s equals G;,, for some t with x;,, = 1. Hence 
the sample space may be taken to be Q = {0, 1}”, where M < NG* is the total number of 
those x; , that are equal to 1. 

Say that an action i is up at time t if x;, = 1. Say that an action i is marked at time t if 
i is up at time ¢t and Z; g,, = 1. Hence, the strategy selects action i at time ¢ if i is marked 
at time ¢, no other action j < i is marked at time f, and no action has been marked before 
time t. 

Let A C Q be the event such that Z; s =0 for j=1,...,N ands =1,...,d. Let 
Wi C Q be the event such that (i) action 7 is the only action marked at time ¢ and (ii) 
Zj,5 =Oforall j =1,...,N andalld < s < G js. Note that W;,, and W; are disjoint for 
i + j and W;,, is empty unless G; > d. Let W = Us W;,- Define the one-to-one mapping 
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u with domain A N W that, for each 7 and t, maps each elementary event w € AM Wj, to 
the elementary event j1(@) such that (recall that p(k) = min{1, e° k=b)) 


e if p(Gi, — d) > 1/2, then, u(œw) is equal to w except for the component of œw corre- 
sponding to Z;,G,,-a, which is set to 1. 

e if p(Gi:— d) < 1/2, then u(w) is equal to w except for the component of Z; g 
is set to 0, and the component of Z;,g,,-a, which is set to 1. 


which 


it? 


(Note that for all œ € AN Wis, Zig = 1 and Zj,g,,-a = 0 by definition.) 

The realization of the event u(A N W) implies that the strategy selects an action i at 
time s such that G; — G;,; > d (we call this a winning event). In fact, A N W; s implies 
that no action other than i is selected before time t. Hence, Z; G, -a = 1 ensures that i is 
selected at time s < t and G; — G; s > d. 

We now move on to lower bounding the probability of u(A N W;,;) in terms of 
P [A N Wis]. We distinguish three cases. 


Case 1. If p(Gi,—d)=1, then P[ (A N Wi] > P[A N Wir = 0. 


Case 2. If 1/2 < p(G;, —d) < 1, then 


P(Gixt —d) 


P[ uA N Wi)| = LaG2 =a 


P[AN Wi, ], 


and using 1/2 < p(G;,, — d) < 1 together with e~*4 > 0, 


D(Gix a d) e744 et Gi s—b e442 


= X , 
1— p(Gir—d)  1— e7% et Gish = | — e744 /2 


Case 3. If p(Gi s — d) < 1/2, then 


P(Gir —d) 1-— p(Gi) 
1— p(Gir-—d)  p(Giz) 


P [u(A M Wi)] = P[A f Wir s 


Using p(Gi, — d) < 1/2 and e~“ < 1, we get 


PGir- 4) 1- PGi) _ aa _1- PGi) er /2 
1- pGir-d)  pGis) Le pG) ~~ Laem 2 


Now, because 


—ad 2 —ad 1 
e 7] _ e a ad Sey? 
1— e712 2—e-44 ~ | +ad 


exploiting the independence of A and W we get 


P[u(A N W)] > (1 — 2ad) P[A] P{W]. 
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We now proceed by lower bounding the probabilities of the events A and W. For A we get 


N 
PIA] > [ [(1 — pinta, nj)" 


j 
(1 = ee al 
1 


IV IV 


To lower bound P[W], first observe that P[W] > 1 — (1 — p(G* — d))“. Indeed, the com- 
plement of W implies that no action is ever marked after it has been up for d times, and this 
in turn implies that even the action that gains G* is not marked the last d times it has been 
up (here, without loss of generality, we implicitly assume that G* > 2d). Furthermore, 


i * = 
no a 
Piecing all together, we obtain 
P[u(A N W)] 
> (1 — 2ad) (1 _dN owt) (1 _ (1 = noe) 
Sie ad aN ett? = et Oe 


To complete the proof, note that the setting a = 5/(6d) implies 2ad = 5/3. Moreover, 
the setting b = In(6dN /8) implies dN e“4~” < 5/3. Finally, using 1 — x < e~* for all x, 
straightforward algebra yields (1 — ene"—a)-0\? < 6/3 whenever 


6 6N 3 
G* >d(1+-—In{| —In-—)}. 
ô ô ô 
This concludes the proof. W 
We now state and prove the lower bound. 


Theorem 6.13. Let N > 2,d > 1,and0 < ô < (In N)/(n N + 1). For any selection strat- 
egy there exists a gain assignment with 


x 1— ô 
Gr= 7 InN |d 


such that, with probability at least 1 — ô, the gain of the strategy is not more than d. 


Proof. As in the proof of Theorem 6.11 in Section 6.9, we prove the lower bound with 
respect to a random gain assignment and a possibly deterministic selection strategy. Then 
we use Fubini’s theorem to turn this into a lower bound with respect to a deterministic 
gain assignment and a randomized selection strategy. We begin by specifying a probability 
distribution over gain assignments. Each action i satisfies x;, = 1 for t = 1,..., T; — 1 
and x;,, = 0 for allt > T;, where Ti, ..., Ty are random variables specified as follows. Say 
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that an action 7 is alive at time t if t < T; and dead otherwise. Immediately before each 
of the time steps t = 1, 1 + d, 1+ 2d, . . . a random subset of the currently alive actions is 
chosen and all actions in this subset become dead. The subset is chosen so that for each k = 
1,2,... there are exactly Ny = PN 14m] actions alive at time steps 1 + (k — l)d,..., kd, 
where the integer m will be determined by the analysis. Eventually, during time steps 
1+ (m-—1)d,...,md only one action remains alive (as we have N,, = 1), and from 
t = 1 + md onward all actions are dead. Note that G* = md with probability 1. 

It is easy to see that the strategy maximizing the probability, with respect to the random 
generation of gains, of choosing an action that survives for more than d time steps should 
select (it does not matter whether deterministically or probabilistically) an action among 
those still alive in the interval 1 + (m — 2)d,...,(m— 1)d when Nm-1 = rN iy actions 


are still alive. Set 
m= InN |. 
ô 


The assumption ô < (In N)/(ln N + 1) guarantees that m > 1 (if m = 1, then the optimal 
strategy chooses a random action at the very beginning of the game). Because Nm = 1, the 
probability of picking the single action that survives after time (m — 1)d is at most 


3/(1-8) 


N1” <N iw 
= e7®/0-8) 


< e™l-8) (using In(1 + x) < x for all x > —1) 


<1-6. 


Following the argument in the proof of Theorem 6.11, we view a randomized selec- 
tion strategy as a probability distribution over deterministic strategies and then apply 
Fubini’s theorem. Formally, for any randomized selection strategy achieving a total gain 
of G, 


inf V [hesa] < EE [Leza] = E E [Léa] < SsuD [Ii¢-a] < 1-4, 


where u ranges over all gain assignments, E’ is the expectation taken with respect to 
the selection strategy’s internal randomization, E is the expectation taken with respect to 
the random choice of the gain assignment, and S ranges over all deterministic selection 
Strategies. I 


6.11 Bibliographic Remarks 


The problem of label efficient prediction was introduced by Helmbold and Panizza [155] 
in the restricted setup when losses are binary valued and there exists an action with zero 
cumulative loss. The material of Sections 6.2 and 6.3 is due to Cesa-Bianchi, Lugosi, and 
Stoltz [55]. However, the proof of Theorem 6.4 is substantially different from the proof 
appearing in [55] and borrows ideas from analogous results from statistical learning theory 
(see Devroye and Lugosi [89] and also Devroye, Györfi, and Lugosi [88]). 

The notion of partial monitoring originates in game theory and was considered, 
among others, by Mertens, Sorin, and Zamir [216], Rustichini [251], and Mannor and 
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Shimkin [208]. Weissman and Merhav [308] and Weissman, Merhav, and Somekh- 
Baruch [309] consider various prediction problems in which the forecaster only observes a 
noisy version of the true outcomes. These may be considered as special partial monitoring 
problems with random feedback (see Exercises 6.11 and 6.12). 

Piccolboni and Schindelhauer [234] rediscovered partial monitoring as a sequential 
prediction problem. Later, Cesa-Bianchi, Lugosi, and Stoltz [56] extended the results 
in [234] and addressed the problem of optimal rates. See also Auer and Long [14] for 
an analysis of some special cases of partial monitoring in prediction problems. 

The forecaster strategy studied in Section 6.5 was defined by Piccolboni and Schindel- 
hauer [234], who showed that its expected regret has a sublinear growth. The optimal rate of 
convergence and pointwise behavior of this strategy, stated in Theorems 6.5 and 6.6, were 
established in [56]. Piccolboni and Schindelhauer [234] describe sufficient and necessary 
conditions under which Hannan consistency is achievable (see also [56]). Rustichini [251]) 
and Mannor and Shimkin [208] consider a more general setup in which the feedback is 
not necessarily a deterministic function of the outcome and the action chosen by the fore- 
caster, but it may be random with a distribution depending on the action/outcome pair. 
Rustichini establishes a general existence theorem for Hannan-consistent strategies in this 
more general framework, though he does not offer an explicit prediction strategy. Mannor 
and Shimkin also consider cases when Hannan consistency may not be achieved, give a 
partial solution, and point out important difficulties in such cases. 

The apple tasting problem was considered by Helmbold, Littlestone, and Long [154] in 
the special case when one of the actions has zero cumulative loss. 

Multi-armed bandit problems were originally considered in a stochastic setting (see 
Robbins [245] and Lai and Robbins [189]). Several variants of the basic problems have been 
studied, see, for example, Berry and Fristedt [26] and Gittins [129]. The nonstochastic bandit 
problem studied here was first considered by Baños [21] (see also Megiddo [212]). Hannan- 
consistent strategies were constructed by Foster and Vohra [106], Auer, Cesa-Bianchi, 
Freund, and Schapire [12], and Hart and Mas Colell [145, 147], see also Fudenberg and 
Levine [119]. Hannan consistency of the potential-based strategy analyzed in Section 6.7 
was proved by Hart and Mas-Colell [147] in the special case of the quadratic potential. The 
algorithm analyzed by Auer, Cesa-Bianchi, Freund, and Schapire [12] uses the exponential 
potential and is a simple variant of the strategy of Section 6.7. The multi-armed bandit 
strategy of Section 6.8 and the corresponding performance bound is due to Auer et al. 
[12], see also [10] (though the result presented here is an improved version). Theorem 6.11 
is due to Auer et al. [12], though the main change-of-measure idea already appears in 
Lai and Robbins [189]. For related lower bounds we refer to Auer, Cesa-Bianchi, and 
Fischer [11], and Kulkarni and Lugosi [188]. Extensions of the nonstochastic bandits to 
a scenario where the regret is measured with respect to sequences of actions, instead of 
single actions (extending to the bandit framework the results on tracking actions described 
in Chapter 5), have been considered in [12]. 

Our formulation and analysis of the problem of selecting the best action is based on 
the paper by Awerbuch, Azar, Fiat, and Leighton [18]. Considering a natural extension of 
the setup described here, in the same paper Awerbuch et al. show that with probability 
1 — O(1/N) one can achieve a total gain of dInN on any gain assignment satisfying 
G* > Cdln N whenever the selection strategy is allowed to change the selected action at 
least log. N times for all (sufficiently large) constants C. By slightly increasing the number 
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of allowed changes, one can also drop the requirement (implicit in Theorem 6.12) that an 
estimate of D* is preliminarily available to the selection strategy. 


6.1 


6.2 


6.3 
6.4 


6.5 


6.6 
6.7 


6.8 


6.12 Exercises 


Consider the “zero-error” problem described at the beginning of Section 2.4, that is, when 
YV = D = {0, 1}, &(P, y) = [P — y| € {0, 1} and minj=),...v Lin = 0. Consider a label efficient 
version of the “halving” algorithm described is Section 2.4, which asks for a label randomly 
similarly to the strategy of Section 6.2. Show that if the expected number of labels asked by 
the forecaster is m, the expected number of mistakes it makes is bounded by (n/m) log, N 
(Helmbold and Panizza [155].) Derive a corresponding mistake bound that holds with large 
probability. 

(Switching actions not too often) Prove that in the oblivious opponent model, the lazy label 
efficient forecaster achieves the regret bound of Theorem 6.1 with the additional feature that 
with probability 1, the number of changes of an action (i.e., the number of steps where J; # 1,41) 
is at most the number of queried labels. 


(Continued) Strengthen Theorem 6.2 in the same way. 


Let m < n. Show that there exists a label efficient forecaster that, with probability at least 1 — ô, 
reveals at most m labels and has an excess loss bounded by 


eee ( [nL* In(4N /8) pt maa) l 
m m 


where L* = min;=1,...N Li n and c is a constant. Note that this inequality refines Theorem 6.2 
in the spirit of Corollary 2.4 and, in the special case of L* = 0, matches the order of magnitude 
of the bound of Exercise 6.1 (Cesa-Bianchi, Lugosi, and Stoltz [55].) 


Complete the details of Hannan-consistency under the assumptions of Theorem 6.6. The pre- 
diction algorithm of the theorem assumes that the total number n of rounds is known in advance, 
because both 7 and y depend on n. Show that this assumption may be dropped and construct a 
Hannan-consistent procedure in which n and y only depend on f. (Cesa-Bianchi, Lugosi, and 
Stoltz [56].) 


Prove Theorem 6.7. 


(Fast rates in partial monitoring problems) Consider the version of the dynamic pricing 
problem described in Section 6.4 in which the feedback matrix H is as before but the losses are 
given by £(i, j) = |i — j|/N. Construct a prediction strategy for which, with large probability, 


Hint: Observe that the feedback reveals the value of the “derivative” of the loss and use an 
algorithm based on the gradient of the loss, as described in Section 2.5. 


(Consistency in partial monitoring) Consider the prediction problem with partial monitoring in 
the special case when the predictor has N = 2 actions. (Note that the number M of outcomes may 
be arbitrary.) Assume that the 2 x M loss matrix L is such that none of the two actions dominates 
the other in the sense that it is not true that for some i = 1, 2 and i’ Æ i, €(i, j) < &(i', j) for 
all j = 1,..., M. (If one of the actions dominates the other, Hannan consistency is trivial to 
achieve by playing always the dominating action.) Show that there exists a Hannan-consistent 
procedure if and only if the feedback is such that a feedback matrix H can be constructed such 
that L = KH for some 2 x 2 matrix K. Hint: If L = KH, then Theorem 6.6 guarantees Hannan 
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consistency. It suffices to show that if for all possible encoding of the feedback by real numbers 
the rank of H is 1, then Hannan consistency cannot be achieved. 


(Continued) Consider now the case when M = 2, that is, the outcomes are binary but N may 
be arbitrary. Assume again that there is no dominating action. Show that there exists a Hannan 
consistent procedure if and only if for some version of H we may write L = KH for some 
N x N matrix K. Hint: If the rank of L is less than 2, then there is a dominating action, so that 
we may assume that the rank of L is 2. Next show that if for all versions of H, the rank of H is 
at most 1, then all rows of H are constants and therefore there is no useful feedback. 


Consider the partial monitoring prediction problem. Assume that M = N and the loss matrix 
L is the identity matrix. Show that if some columns of the feedback matrix H are identi- 
cal then Hannan consistency is impossible to achieve. Partition the set of possible outcomes 
{1,..., M} into k sets S,,...,.$; such that in each set the columns of the feedback matrix 
H corresponding to the outcomes in the set are identical. Denote the cardinality of these sets 
by Mı = |S\|,..., My = |S,|. (Thus, M, +---+ M, = M.) Form an N x k matrix H’ whose 
columns are the different columns of the feedback matrix H such that the jth column of H’ is 
the column of H corresponding to the outcomes in the set S;, j = 1,...,k. Let L’ be the N x k 
matrix whose jth column is the average of the M; columns of L corresponding to the outcomes 
in S;, j =1,...,k. Show that if there exists an N x N matrix K with L’ = KH’, then there is 
randomized strategy such that 


n k Mj 
1 l 1 ' 
B 2 mesnr, 2 t6 m) = 0(1), 


Hoa t=1 j=1 m=1 


with probability 1. (See Rustichini [251] for a more general result.) 


(Partial monitoring with random feedback) Consider an extension of the prediction model 
under partial monitoring described in Section 6.4 such that the feedback may not be a simple 
function of the action and the outcome but rather a random variable. More precisely, denote by 
A(S) the set of all probability distributions over the set of signals S. The signaling structure 
is formed by a collection of NM probability distributions pg, j over S fori = 1,..., N and 
j =1,...,M. At each round, the forecaster now observes a random variable H (/,, y,), drawn 
independently from all the other random variables, with distribution ju, y,). 

Denote by E,,;) the expectation of ua, and by E the N x M matrix formed by these 
elements. Show that if there exists a matrix K such that L = KE, then a Hannan-consistent 
forecaster may be constructed. Hint: Consider the modification of the forecaster of Section 6.5 
defined by the estimated losses 

Li, y) = KG, DHU, yı) i=1,..., N. 
Pht 
(Noisy observation) Consider the forecasting problem in which the outcome sequence is binary, 
that is, Y = {0, 1}, expert i predicts according to the real number /;, € [0,1] @ = 1,..., N), 
and the loss of expert i is measured by the absolute loss £( fit, yt) = | fi. — y| (as in Chapter 8). 
Suppose that instead of observing the true outcomes y,, the forecaster only has access to a “noisy” 
version y; ® B,, where @ denotes the XOR operation (i.e., addition modulo 2) and B,,..., By 
are i.i.d. Bernoulli random variables with unknown parameter 0 < p < 1/2. Show that the 
simple exponentially weighted average forecaster achieves an expected regret 


(See Weissman and Merhav [308] and Weissman, Merhav, and Somekh-Baruch [309] for various 
versions of the problem of noisy observations.) 
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(Label efficient partial monitoring) Consider the label efficient version of the partial moni- 
toring problem in which, during the n rounds of the game, the forecaster can ask for feedback 
at most m times at periods of his choice. Assume that the loss and feedback matrices satisfy 
L = KH for some N x N matrix K = [k;,;], with k* = max{1, max; ; |k(i, j)|}. Construct a 
forecasting strategy whose expected regret satisfies 


(NK*)?3 (In NYY 
y ENA 


EL, — min ELin <c : 
N i m! 


i=l,..., 
where c is a constant. Derive a similar upper bound for the regret that holds with high probability. 


(Internal regret minimization under partial monitoring) Consider a partial monitoring prob- 
lem in which the loss and feedback matrices satisfy L = K H for some N x N matrix K = [k; ;], 
with k* = max{1, max;,,; |k(i, j)|}. Construct a forecasting strategy such that, with probability 
at least 1 — ô, the internal regret 


n 


iy 3 Been 


is bounded by a constant times ((k*)? N5 In(N /8)n?) 1/3 (Cesa-Bianchi, Lugosi, and Stoltz [56].) 
Hint: Use the conversion described at the end of Section 4.4 and proceed similarly as in 
Section 6.5. 

(Internal regret minimization in the bandit setting) Construct a forecasting strategy that 
achieves, in the setting of the bandit problem, an internal regret of order ./nN In(N/5). Hint: 
Combine the conversion used in the previous exercise with the techniques of Section 6.8. 


(Compound actions and partial monitoring) Let S be a set of compound actions (i.e., 
sequences i = (i,,...,7,) of actions i, € {1,..., N}) of cardinality |S| = M. Consider a partial 
monitoring problem in which the loss and feedback matrices satisfy L = KH for some N x N 
matrix K. Construct a forecaster strategy whose expected regret (with respect to the S) satisfies 


i = . 2/3 
ED ACs Y,)- mak Ds Y) <3(kVe—2) (a)*8anMy", 


(Note that the dependence on the number of compound actions is only logarithmic!) Derive a 
corresponding bound that holds with high probability. Hint: Consider the forecaster that assigns 
a weight to every compound actioni = (i1, ...,i,) E€ S updated by wi, = wipe ee (where 
T is the estimated loss introduced in Section 6.5) and draws action 7 with probability 


Pinzi Wit 4 Y 


Pir =- y) 


Dies Wit N 
(The stochastic multi-armed bandit problem) Consider the multi-armed bandit problem under 
the assumption that for each i € {1,..., N}, the losses €(, y1), £(i, y2),... form an indepen- 


dent, identically distributed sequence of random variables such that E |€(i, y,)| < oo for all i. 
Construct a nonrandomized prediction scheme which guarantees that 


1 | ae eee 
= 2 e, y) = = min, Lt y) > 0, 
t= t= 


with probability 1. 

(Continued) Consider the setup of the previous example under the additional assumption that 
the losses (i, y,) take their values in the interval [0, 1]. Consider the prediction rule that, at 
time t, chooses an action i € {1,..., N} by minimizing 


re Iu, =li, Ys) + a 
=I = ; 
ya lussi Daas rare 
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Prove that the expected number of times any action 7 with 


is played is bounded by a constant times Inn. (Auer, Cesa-Bianchi, and Fischer [11], see also 
Lai and Robbins [189] for related results). 


probability 1. This follows from the fact that for any fixed i, 


n 


1 k = a Rin Rin 1 ~ 
l (Em5 eiw) = Ae ho Ds ea) 


t=1 


modification of Theorem 2.1. The additional difficulty is that the Blackwell condition is not 
satisfied and the first-order term cannot be ignored. Use Taylor’s theorem to bound ®(R,) in 
terms of ®(R,_;) to get 


O(R,,) 


n ah 7 1 n 
<0) + )TE[VOR_1)-F[h,.... ta] + 5 CE) 
t=1 


t=1 


+) (VOR) F — E[VOR-1)-¥F |h,....4-1]). (6.8) 


t=1 


Because max; Rin < piy (@(R,,))), it suffices to show that the last three terms on the 
right-hand side are of smaller order than (¢(7)), almost surely. 
To bound the first of the three terms, use the unbiasedness of the estimator T, to show 


N 
E[VOR1)-F h La) y Y Vi OR). 


i=] 


Because the components of the regret vector satisfy Rii < t — 1, assumption (iii) guarantees 
that 


SCE[VOR-1)-F h, L-1] = ohm). 


t=1 
To bound the last term on the right-hand side of (6.8), observe that 


N N 
x ee ss ~ N = 
VE(R;-1): T; < max |r| J Vi®(R,-1) < z J ViR, -1) 


t j=1 


j=1 
and use the Hoeffding—Azuma inequality. 


Consider the potential-based strategy for the multi-armed bandit problem of Section 6.7 with 
the time-varying exponential potential 


1 N 
®,(u) = . In (>: | ; 


4 i=l 


where 7, > 0 may depend on t. Combine the proof of Theorem 6.9 with the techniques of 
Section 2.3 to determine values of n, and y, that lead to a Hannan-consistent strategy. 


6.21 


6.22 


6.12 Exercises 179 


(The tracking problem in the bandit setting) Consider the problem of tracking the best expert 
studied in Section 5.2. Extend Theorem 5.2 by bounding the expected regret 


E ew. Y,) =: ein ro] 
t=1 t=1 


for any action sequence i;, . . . , in under the bandit assumption: after making each prediction, the 
forecaster learns his own loss £(/,, Y;) but not the value of the outcome Y, (Auer, Cesa-Bianchi, 
Freund, and Schapire [12]). Hint: To bound the expected regret, it is enough to consider the 
forecaster of Section 6.7, drawing action i at time ¢ with probability 


ew nkia-t y 
GA e™"Lkt-1 t N 


Pit = ad 


where Li; = S“ či, Y,) and 


(i, Y,)/Pia ifl =i 
0 otherwise. 


Ki, Y) = | 


To get the right dependence on the horizon n in the bound, analyze this forecaster using parts 
of the proof of Theorem 6.10 combined with the proof of Theorem 5.2. 


In the setup described in Section 6.10, at each time step ¢ before the time step where an action is 
drawn, the selection strategy observes the gains x1, ..., Xy.: Of all the actions. Prove a “bandit” 
variant of Theorem 6.12 where the selection strategy may only observe the gain xg, of a single 
action k, where the index k of the action to observe is specified by the strategy. 
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Prediction and Playing Games 


7.1 Games and Equilibria 


The prediction problems studied in previous chapters have been often represented as 
repeated games between a forecaster and the environment. Our use of a game-theoretic 
formalism is not accidental: there exists an intimate connection between sequential pre- 
diction and some fundamental problems belonging to the theory of learning in games. We 
devote this chapter to the exploration of some of these connections. 

Rather than giving an exhaustive account of the area of learning in games, we only focus 
on “regret-based” learning procedures (i.e., situations in which the players of the game base 
their strategies only on regrets they have suffered in the past) and our fundamental concern 
is whether such procedures lead to equilibria. We also limit our attention to finite strategic 
or normal form games. 

In this introductory section we present the basic definitions of the games we consider, 
describe some notions of equilibria, and introduce the model of playing repeated games 
that we investigate in the subsequent sections of this chapter. 


K -Person Normal Form Games 

A (finite) K-person game given in its strategic (or normal) form is defined as follows. 
Player k (k = 1,..., K) has N; possible actions (or pure strategies) to choose from, where 
N; is a positive integer. If the action of each player k = 1,..., K is ip € {1,..., Nx} and 
we denote the K -tuple of all the players’ actions by i= (i, ..., ix) € Ql, ..., Nx}, 
then the Joss suffered by player k is €(i), where 2 : @e_j{1,..., Ne} > [0, 1] for each 
k =1,...,K are given loss functions for all players. Note that, slightly deviating from 
the usual game-theoretic terminology, we consider losses as opposed to the more standard 
payoffs. The reader should keep in mind that the goal of each player is to minimize his loss, 
which is the same as maximizing payoffs if one defines payoffs as negative losses. We use 
this convention to harmonize notation with the rest of the book. 

A mixed strategy for player k is a probability distribution p® = ( p”, “ale py) over 
the set {1,..., Ng} of actions. When mixed strategies are used, players randomize, that is, 
choose an action according to the distribution specified by the mixed strategy. Denote the 
action played by player k by /. Thus, 7 is a random variable taking values in the set 
{1,..., Ng} and distributed according to p®. Let I = (7, ..., 7) denote the K -tuple of 
actions played by all players. If the random variables 7™®, ..., “ are independent (i.e., the 
players randomize independently of each other), we denote their joint distribution by z. That 
is, 7 is the joint distribution over the set Q {1,..., Ng} of all possible K -tuples of actions 
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obtained by the product of the mixed strategies p”, ..., p“. The product distribution 7 
is called a mixed strategy profile. Thus, for alli = (i1, ..., ik) € QE, sea Ng}, 


x) = PIL =i] = pj x- x pp. 
The expected loss of player k is 
re) & RELOD 
= J OPA 


iQ 1... Ned} 
Ni Nx 
= yoo aie Kee X Be Oi, iK). 
= ix=l 
Nash Equilibrium 
Perhaps the most important notion of game theory is that of a Nash equilibrium. A mixed 
strategy profile 7 = p\ x --- x p™ is called a Nash equilibrium if for all k = 1,..., K 


and all mixed strategies q®, if 7; = p x --. x q® x --- x p% denotes the mixed strat- 
egy profile obtained by replacing p® by q“ and leaving all other players’ mixed strategies 
unchanged, then 


nh < me, 


This means that if x is a Nash equilibrium, then no player has an incentive of changing 
his mixed strategy if all other players do not change theirs (i.e., every player is happy). 
A celebrated result of Nash [222] shows that every finite game has at least one Nash 
equilibrium. (The proof is typically based on fixed-point theorems.) However, a game may 
have multiple Nash equilibria, and the set M of all Nash equilibria can have a quite complex 
structure. 


Two-Person Zero-Sum Games 

A simple but important special class of games is the class of two-person zero-sum games. 
These games are played by two players (i.e., K = 2) and the payoff functions are such that 
for each pair of actions i = (i1, i2), where i, € {1,..., Ni} andiz € {1,..., N2}, the losses 
of the two players satisfy 


LOW = 29°. 


Thus, in such games the objective of the second player (often called column player) is to 
maximize the loss of the first player (the row player). To simplify notation we will just write 
£ for 2, replace N1, N2 by N and M, and write (i, j) instead of (i4, i2). Mixed strategies 
of the row and column players will be denoted by p = (p1,..., py) andq = (q1,..., gm). 
It is immediate to see that the product distribution z = p x q is a Nash equilibrium if and 
only if for all p' = (p},..., py) and q = (qi, ---; gy). 


N M N M 
pigie DY Y pati Ns >_> piati, j). 
i=l j=1 i=1 j=1 


3 


N 


l 


1 j=l 
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Introducing the simplifying notation 


N M 
&p.a) = >>> pati, j) 
i=l j=l 
the above is equivalent to 
max Lp, q) = £, q) = min Lp’, q). 
This obviously implies that 
max £(p, q’) < max min (p’, q^), 
q C Pp 
and therefore the existence of a Nash equilibrium p x q implies that 
min max £(p', q’) < max min £(p', q’). 
P g q P 
On the other hand, clearly, for all p and q', €(p, q’) > miny £(p', q’) and therefore for all 
p, maxy €(p, q’) > maxy miny £(p’, q’), and in particular, 
min max €(p’, q^) > max min £(p’, q^). 
P q q Pp 
In summary, the existence of a Nash equilibrium implies that 
min max £(p', q’) = max min £(p', q’). 
P qg g P 
The common value of the left-hand and right-hand sides is called the value of the game 
and will be denoted by V. This equation, known as von Neumann’ s minimax theorem, is 
one of the fundamental results of game theory. Here we derived it as a consequence of 
the existence of Nash equilibria (which, in turn, is based on fixed-point theorems), but 
significantly simpler proofs may be given. In Section 7.2 we offer an elementary “learning- 
theoretic” proof, based on the basic techniques introduced in Chapter 2, of a powerful 
minimax theorem that, in turn, implies von Neumann’s minimax theorem for two-person 
zero-sum games. 


It is also clear from the argument that any Nash equilibrium p x q achieves the value of 
the game in the sense that 


ip, =V 
and that any product distribution p x q with £(p, q) = V is a Nash equilibrium. 


Correlated Equilibrium 

An important generalization of the notion of Nash equilibrium, introduced by Aumann [16], 
is the notion of correlated equilibrium. A probability distribution P over the set 
Qi. ..., Nk} of all possible K -tuples of actions is called a correlated equilibrium 
if for all k = 1,..., K, 


OD < EL, 7), 


where the random variable I = (/“), ..., 1) is distributed according to P and (I~, 1) = 
I, ..., 14-9), 1, 16+), ..., 1), where J is an arbitrary {1,..., Nx}-valued ran- 
dom variable that is a function of 7. 
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The distinguishing feature of the notion is that, unlike in the definition of Nash equilibria, 
the random variables 1“ do not need to be independent, or, in other words, P is not 
necessarily a product distribution (hence the name “correlated’’). Indeed, if P is a product 
measure, a correlated equilibrium becomes a Nash equilibrium. The existence of a Nash 
equilibrium of any game thus assures that correlated equilibria always exist. 

A correlated equilibrium may be interpreted as follows. Before taking an action, each 
player receives a recommendation J“) such that the J“ are drawn randomly according to 
the joint distribution of P. The defining inequality expresses that, in an average sense, no 
player has an incentive to divert from the recommendation, provided that all other players 
follow theirs. Correlated equilibria model solutions of games in which the actions of players 
may be influenced by external signals. 

A simple equivalent description of a correlated equilibrium is given by the following 
lemma whose proof is left as an exercise. 


Lemma 7.1. A probability distribution P over the set of all K -tuples i= (i1,...,ix) of 
actions is a correlated equilibrium if and only if, for every player k € {1,..., K} and 
actions j, j' € {1,..., Ng}, we have 


X Pw (LW — L%@, 7) < 0, 


iij 
where i, j^) os (ii, SEN ae J tesa, noig): 


Lemma 7.1 reveals that the set of all correlated equilibria is given by an intersection of 
closed halfspaces and therefore it is a closed and convex polyhedron. For example, any 
convex combination of Nash equilibria is a correlated equilibrium. (This can easily be seen 
by observing that the inequalities of Lemma 7.1 trivially hold if one takes P as any weighted 
mixture of Nash equilibria.) However, there may exist correlated equilibria outside of the 
convex hull of Nash equilibria (see Exercise 7.4). In general, the structure of the set of 
correlated equilibria is much simpler than that of Nash equilibria. In fact, the existence 
of correlated equilibria may be proven directly and without having to resort to fixed point 
theorems. We do this implicitly in Section 7.4. Also, as we show in Section 7.4, given a 
game, it is computationally very easy to find a correlated equilibrium. Computing a Nash 
equilibrium, on the other hand, appears to be a significantly harder problem. 


Playing Repeated Games 

The most natural application of the ideas of randomized prediction of Chapters 4 and 6 is in 
the theory of playing repeated games. In the model we investigate, a K-person game (the so- 
called one-shot game) is played repeatedly such that at each time instant tf = 1, 2, ... player 
k (k = 1,..., K) selects a mixed strategy p = p“, R PN) over the set {1,..., Ng} 
of his actions and draws an action / o according to this distribution. We assume that the 
randomizations of the players are independent of each other and of past randomizations. 
After the actions are taken, player k suffers a loss LPL), where I, = (I oe ; Ae ). 
Formally, at time ¢ player k has access to U an where the U Ko are independent random 
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variables uniformly distributed in [0, 1], and chooses 7 ® such that 


i-1 


1 =i if and only if UP e 2 Pee Sp 9 
j=l 


so that 


Pis =i | past plays] = p”, i=l Np 


In the basic setup we assume that after taking an action 7 W each player observes all 
other players’ actions, that is, the whole Ķ -tuple I, = (7 ae ; ae ). This means that 
the mixed strategy p% played at time ¢ may depend on the sequence of random variables 
I,,..., 0-1. 

However, we will be concerned only with “uncoupled” ways of playing, that is, when 
each player knows his own loss (or payoff) function but not those of the other players. More 
specifically, we only consider regret-based procedures in which the mixed strategy chosen 
by every player depends, in some way, on his past losses. We focus our attention on whether 
such simple procedures can lead to some kind of equilibrium of the (one-shot) game. In 
Section 7.3 we consider the simplest situation, the case of two-person zero-sum games. We 
show that a simple application of the regret-minimization procedures of Section 4.2 leads 
to a solution of the game in a quite robust sense. 

In Section 7.4 it is shown that if all players play to keep their internal regret small, 
then the joint empirical frequencies of play converge to the set of correlated equilibria. The 
same convergence may also be achieved if every player uses a well-calibrated forecasting 
strategy to predict the K -tuple of actions I, and chooses an action that is a best reply to the 
forecasted distribution. This is shown in Section 7.6. A model for learning in games even 
more restrictive than uncoupledness is the model of an “unknown game.” In this model the 
players cannot even observe the actions of the other players; the only information available 
to them is their own loss suffered after each round of the game. In Section 7.5 we point 
out that the forecasting techniques for the bandit problem discussed in Chapter 6 allow 
convergence of the empirical frequencies of play even in this setup of limited information. 

In Sections 7.7 and 7.8 we sketch Blackwell’s approachability theory. Blackwell consid- 
ered two-person zero-sum games with vector-valued losses and proved a powerful gener- 
alization of von Neumann’s minimax theorem, which is deeply connected with the regret- 
minimizing forecasting methods of Chapter 4. 

Sections 7.9 and 7.10 discuss the possibility of reaching a Nash equilibrium in uncoupled 
repeated games. Simple learning dynamics, versions of a method called “regret testing,” are 
introduced that guarantee that the joint plays approach a Nash equilibrium in some sense. 
The case of unknown games is also investigated. 

In Section 7.11 we address an important criticism of the notion of Hannan consistency. 
In fact, the basic notions of regret compare the loss of the forecaster with that of the best 
constant action, but without taking into account that the behavior of the opponent may 
depend on the actions of the forecaster. In Section 7.11 we show that asymptotic regret 
minimization is possible even in the presence of opponents that react, although we need to 
impose certain restrictions on the behavior of the opponents. 
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Fictitious Play 

We close this introductory section by discussing perhaps the most natural strategy for 
playing repeated games: fictitious play. Player k is said to use fictitious play if, at every 
time instant f, he chooses an action that is a best response to the empirical distribution of 
the opponents’ play up to time t — 1. In other words, the player chooses 7 » to minimize 
the estimated loss 


where (I, ip) = (I®, ... „ik, ..., 1“). The nontrivial behavior of this simple strategy is a 
good demonstration of the complexity of the problem of describing regret-based strategies 
that lead to equilibrium. As we already mentioned in Section 4.3, fictitious play is not 
Hannan consistent. This fact may invite one to conjecture that there is no hope to achieve 
equilibrium via fictitious play. It may come as a surprise that this is often not the case. 
First, Robinson [246] proved that if in repeated playing of a two-person zero-sum game 
at each step both players use fictitious play, then the product distribution formed by the 
frequencies of actions played by both players converges to the set of Nash equilibria. This 
result was extended by Miyasawa [218] to general two-person games in which each player 
has two actions (i.e., Nz = 2 for all k = 1, 2); see also [220] for more special cases in which 
fictitious play leads to Nash equilibria in the same sense. However, Shapley [265] showed 
that the result cannot be extended even for two-person non-zero-sum games. In fact, the 
empirical frequencies of play may not even converge to the set of correlated equilibria (see 
Exercise 7.2 for Shapley’s game). 


7.2 Minimax Theorems 


As a first contact with game-theoretic applications of the prediction problems studied in 
earlier chapters, we derive a simple learning-style proof of a general minimax theorem 
that implies von Neumann’s minimax theorem cited in the introduction of this chapter. In 
particular, we prove the following. 


Theorem7.1. Let f (x, y) denote a bounded real-valued function defined on X x Y, where 
X and Y are convex sets and X is compact. Suppose that f (-, y) is convex and continuous 
for each fixed y € Y and f(x, -) is concave for each fixed x € X. Then 


inf sup f(x, y) = sup inf f(x, y). 
xEX yey yey xEX 


Proof. For any function f, one obviously has 


inf sup f(x, y) > sup inf f(x, y). 
xEX yey yey xE 
(To see this, just note that for all x’ € X and y € V, f(x’, y) > infy f(x, y), so that for all 
x’ EX, sup, fœ, y= sup, inf, f(x, y), which implies the statement.) 
To prove the reverse inequality, without loss of generality, we may assume that f(x, y) € 
[0, 1] for each (x, y) € X x V. Fix a small £ > 0 and a large positive integer n. By the 
compactness of X one may find a finite set of points {x®, ..., x} C Æ such that each 
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x € X is within distance € of at least one of the x. We define the sequences x1, . . . , Xn € Æ 
and y1,..., Yn € YV recursively as follows. yo is chosen arbitrarily. For each t = 1,...,7 
let 


SNe ES 
Xt = 


SL e71 £20 FM, ys) 


where n = /81In N /n and y; is such that fœ, y1) = supyey fr, y) — 1/n. Then, by the 
convexity of f in its first argument, we obtain (by Theorem 2.2) that 


N 
PAE Poona ere (7.1) 


Thus, we have 


inf sup f(x, y) 
xEX yey 


1 n 
< sup f (G ean >) 
t=1 


yey 


n 


1 
sup — poe f(x,y) (by convexity of f(-, y)) 
yey M 


IA 


ee 


sup f.9) 
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Tetin 


InN 
< min f 3 Yi re — (by concavity of f(x, -)) 
InN 1 
< sup min „fE ©, y) + — +. 
yeyi=l.. 2n n 


Thus, we have, for each n, 


inf sup f(x, y) S sup, min yt & D y)+,/——+-, 
X yey =l,..., 2n n 


so that, letting n — oo, we get 


r oro: y) < we _min „fE 5; y). 


isei 


Letting € — 0 and using the continuity of f, we conclude that 
inf sup f(x, y) < sup inf f(x, y), 
xEX yey yey xEX 


as desired. W 
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Remark 7.1 (von Neumann’s minimax theorem). Observe that Theorem 7.1 implies von 
Neumann’s minimax theorem for two-person zero-sum games. To see this, just note that 
function £ is bounded and linear in both of its arguments and the simplex of all mixed 
strategies p (and similarly q) is a compact set. In this special case the infima and suprema 
are achieved. 


7.3 Repeated Two-Player Zero-Sum Games 


We start our investigation of regret-based strategies for repeated game playing from the 
simple case of two-person zero-sum games. Recall that in our model, at each round t, 
based on the past plays of both players, the row player chooses an action J; € {1,..., N} 
according to the mixed strategy P, = (P1.1,---, Pwr) and the column player chooses an 
action J; = {1,..., M} according to the mixed strategy q; = (q1.1,---, Gur). The distri- 
butions p, and q, may depend on the past plays of both. The row player’s loss at time t 
is €(/;, J+) and the column player’s loss is —£(,, J;). At each time instant, after making 
the play, the row player observes the losses (7, J;) he would have suffered had he played 
strategyi,i = 1,..., N. 

In view of studying the convergence to equilibrium in such games, we consider the 
problem of minimizing the cumulative loss the row player. If the row player knew the 
column player’s actions J;,..., Jn in advance, he would, at each time instant, choose his 
actions to satisfy /; = argmin;_,__ €(, J+) invoking a total loss bar minj=1,...v €(@, Ji). 
Achieving a cumulative loss close to this minimum without knowing the column player’s 
actions is, except for trivial cases, impossible (see Exercise 7.6), and so the row player has 
to put up with a less ambitious goal. A meaningful objective is to play almost as well as the 
best constant strategy. Thus, we consider the problem of minimizing the difference between 
the row player’s cumulative loss and the cumulative loss of the best constant strategy, 
that is, 


Suey 


By a simple application of regret-minimizing forecasters we show that simple strategies 
indeed exist such that this difference grows sublinearly (almost surely) no matter how the 
column player plays. This result is a simple consequence of the Hannan consistency results 
of Chapter 4. 

It is natural to consider regret-minimizing strategies for both players. For example, we 
may assume that the players play according to Hannan consistent forecasting strategies. 
More precisely, assume that the row player chooses his actions /, such that, regardless of 
what the column player does, 


n->0o = 


Perce 


ie Tee 
lim sup (G y L, J). min, s y Li, n) <0 almost surely. 
t=1 t=1 


Recall from Section 4 that several such Hannan-consistent procedures are available. For 
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example, this may be achieved by the exponentially weighted average mixed strategy 


exp (=n EZ eG, J) 
© Eka exp (=n Eek, Jd) 


Pit 


where 7 > 0. For this particular forecaster we have, with probability at least 1 — ô, 


Sea J) — min Sea pa PN o y 
2 i 
Eat i=l no | 59 ape n 8 2 5 


t=) 


(see Corollary 4.2). 


Remark 7.2 (Nonoblivious opponent). It is important to point out that in the definition of 
regret, the cumulative loss )>_, (i, J) associated with the “constant” action i corresponds 
to the sequence of plays J1, ..., Jn of the opponent. The plays of the opponent may depend 
on the forecaster’s actions, which, in this case, are /,,..., /,,. Therefore, it is important to 
keep in mind that if the opponent is nonoblivious (recall the definition from Chapter 4), 
then SS £(i, J;) is not the same as the cumulative loss the forecaster would have suffered 
had he played action J, = i for all t. This issue is investigated in detail in Section 7.11. 


Remark 7.3 (Time-varying games). The inequality above may be extended, in a straight- 
forward way, to the case of time-varying games, that is, when the loss matrix is allowed 
to change with time as long as the entries £;(i, j) stay uniformly bounded, say, between 0 
and 1. The loss matrix £, does not need to be known in advance by the row player. All we 
need to assume is that before making the play at time f, the column @,_;(-, J+—1) of the loss 
matrix corresponding to the opponent’s play in the previous round is revealed to the row 
player. In a more difficult version of the problem, only the suffered loss €;-;U;-1, J+—1) is 
observed by the row player before time t. These problems may be handled by the techniques 
of Section 6.7 (see also Section 7.5). 


We now show the following remarkable fact: if the row player plays according to any 
Hannan consistent strategy, then his average loss cannot be much larger than the value of 
the game, regardless of the opponent’s strategy. (Note that by von Neumann’s minimax 
theorem this may also be achieved if the row player plays according to any minimax 
strategy; see Exercise 7.7.) 

Recall that the value of the game characterized by the loss matrix £ is defined by 


V = max min ¢(p, q), 
a P 


where the maximum is taken over all probability vectors q = (q1, . . - , qm), the minimum 
is taken over all probability vectors p = (q1,..., Dw), and 


z N M 
&p.a) = >>>) piq; ti, j). 


i=l j=l 
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(Note that the maximum and the minimum are always achieved.) With some abuse of 
notation we also write 


N M 
Cp. => pied i) and i,q) = Y qt, j). 
j=l 


i=l 


Theorem 7.2. Assume that in a two-person zero-sum game the row player plays according 
to a Hannan-consistent strategy. Then 


: i 
lim sup — 5 Ll, Jt) < V almost surely. 


noo M 1 


Proof. By the assumption of Hannan consistency, it suffices to show that 


AS PON j 


This may be seen easily as follows. First, 


a 


a ER ade Pe eee 
Bg ae, Dee 


since $; (p, J+) is linear in p and its minimum, over the simplex of probability vectors, 
is achieved in one of the corners. Then, letting Gj, = DDS Iu,=j} be the empirical 
probability of the row player’s action being j, 


n M 
. Ils ; SS G 
min — x £(p, J;) = min ys q jnt, J) 
p ne e 
= min €(p, q,) (where q, = Gin, ---+Gu.n)) 
<maxmin¢(p,q)=V. E 
q P 
Theorem 7.2 shows that, regardless of what the opponent plays, if the row player plays 
according to a Hannan-consistent strategy, then his cumulative loss is guaranteed to be 


asymptotically not more than the value V of the game. It follows by symmetry that if both 
players use the same strategy, then the cumulative loss of the row player converges to V. 


Corollary 7.1. Assume that in a two-person zero-sum game, both players play according 
to some Hannan consistent strategy. Then 


1 n 
im = 2 (l, J) =V almost surely. 


Proof. By Theorem 7.2 


1 n 
lim sup — x (l, JD) <V almost surely. 
n 


noo 
t=1 
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The same theorem, applied to the column player, implies, using the fact that the column 
player’s loss that is the negative of the row player’s loss, that 


in a 
lim inf — 5 (L, J;) > min max £(p, q) almost surely. 
n> Nn i p q 
By von Neumann’s minimax theorem, the latter quantity equals V. E 


Remark 7.4 (Convergence to equilibria). If both players follow some Hannan consistent 
strategy, then it is also easy to see that the product distribution P, x q,, formed by the 
(marginal) empirical distributions of play 


ie 1 n a 1 n 
Bp 2 Men and Gin => dle 
t= t= 


of the two players converges, almost surely, to the set of Nash equilibria z = p x q of the 
game (see Exercise 7.11). However, it is important to note that this does not mean that 
the players’ joint play is close to a Nash equilibrium in the long run. Indeed, one cannot 
conclude that the joint empirical frequencies of play 


She. oo Wee 
PAG, j) = pe l=, 4=/) 


converge to the set of Nash equilibria. All one can say is that P, converges to the Hannan set 
of the game (defined later), which, even for zero-sum games, may include joint distributions 
that are not Nash equilibria (see Exercise 7.15). 


7.4 Correlated Equilibrium and Internal Regret 


In this section we consider repeated play of general K -person games. In the previous section 
we have shown the following fact: if both players of a two-person zero-sum game follow a 
Hannan-consistent strategy (i.e, play so that their external regret vanishes asymptotically), 
then in the long run equilibrium is achieved in the sense that the product of the marginal 
empirical frequencies of play converges to the set of Nash equilibria. It is natural to ask 
whether the joint empirical frequencies of play converge in any general K -person game. 
The answer is easily seen to be negative in general by the following argument. Assume that 
in a repeated K -person game each player follows a Hannan consistent strategy; that is, if the 
K -tuple of plays of all players at time t is I, = (ie 85 Ley, then for all k = 1,...,K 
the cumulative loss of player k satisfies 


bee 1 £ 
lim sup (: >a- ELA DAOP arit To <0 
t=1 Ks i t=1 


no NE ERB gerry 


almost surely. Writing 


n K 

Sam l se A ; 

P,(i) = lai i= (in-i) € ML... Neb 
t=1 k=1 


7.4 Correlated Equilibrium and Internal Regret 191 


for the empirical joint distribution of play, the property of Hannan consistency may be 


rewritten as follows: for all k = 1,..., K and forall j = 1,..., Nz, 
lim su PHO -Y P DEO, j) < 0 
map (Y) LW 2 MEOW, j) < 
almost surely, where (i7, j) = (i1, ..., j,---, ix) denotes the K -tuple obtained when the 


kth component i, of i is replaced by j. Writing the condition of Hannan consistency in 
this form reveals that if all players follow such a strategy, the empirical frequencies of 
play converge, almost surely, to the set H of joint distributions P over QE, -3 Nx} 
defined by 


H= [r :Yk=1,...,K,Yj=1,..., Nk, X PAO < Frowa] 


(see Exercise 7.14). The set H is called the Hannan set of the game. Since C is an 
intersection of Yii N; halfspaces, it is a closed and convex subset of the simplex of 
all joint distributions. In other words, if all players play according to any strategy that 
asymptotically minimizes their external regret, the empirical frequencies of play converge 
to the set H. Unfortunately, distributions in the Hannan set do not correspond to any natural 
equilibrium concept. In fact, by comparing the definition of H with the characterization of 
correlated equilibria given in Lemma 7.1, it is easy to see that H always contains the set C 
of correlated equilibria. Even though for some special games (such as games in which all 
players have two actions; see Exercise 7.18) H = C, in typical cases C is a proper subset of 
H (see Exercise 7.16). Thus, except for special cases, if players are merely required to play 
according to Hannan-consistent strategies (i.e., to minimize their external regret), there is 
no hope to achieve a correlated equilibrium, let alone a Nash equilibrium. 

However, by requiring just a little bit more, convergence to correlated equilibria may be 
achieved. The main result of this section is that if each player plays according to a strategy 
minimizing internal regret as described in Section 4.4, the joint empirical frequencies of 
play converge to a correlated equilibrium. 

More precisely, consider the model of playing repeated games described in Section 7.1. 
For each player k and pair of actions j, j’ € {1,..., Nx} define the conditional instanta- 
neous regret at time t by 

Fine = Wyman (COG) — P7, j^), 
where (I; , j^) = Cheer peor J, i, ..., 1) is obtained by replacing the play of 
player k by action j’. Note that the expected value of the conditional instantaneous regret 
DA 7,2 Calculated with respect to the distribution př of I 9 , is just the instantaneous internal 
regret of player k defined in Section 4.4. The conditional regret }°"_, ae i). Expresses how 
much better player k could have done had he chosen action j’ every time he played 
action j. 

The next lemma shows that if each player plays such that his conditional regret remains 
small, then the empirical distribution of plays will be close to a correlated equilibrium. 


Lemma 7.2. Consider a K-person game and denote its set of correlated equilibria by C. 
Assume that the game is played repeatedly so that for each player k = 1,..., K and pair 
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of actions j, j’ € {1,..., Ng} the conditional regrets satisfy 
lim sup = Die Mina $ 
noo 


Then the distance inf pec }; |P (i) — Pl between the empirical distribution of plays and 
the set of correlated equilibria converges to 0. 


Proof. Observe that the assumption on the conditional regrets may be rewritten as 


limsup X` P,@) (COW — LOW, jN) < 0, 
nC i: i=j 

where (i7, j’) is the same as i except that its kth component is replaced by j’. Assume 
that the sequence P, does not converge to C. Then by the compactness of the set of all 
probability distributions there is a distribution P* ¢ C and a subsequence P, , of empirical 
distributions such that limz_. oo P, = P*. But since P* is not a correlated equilibrium, by 
Lemma 7.1 there exists a player k € {1,..., K} and a pair of actions j, j’ € {1,..., Nx} 
such that 


X PR) (COW — LOW, j^) > 0, 
ifj 


which contradicts the assumption. W 


Now it is easy to show that if all players play according to an internal-regret-minimizing 
strategy, such as that described in Section 4.4, a correlated equilibrium is reached asymptot- 
ically. More precisely, for each player k = 1,..., K define the components of the internal 
instantaneous regret vector by 


k k by ee By 
rina = Dv (e OT) =U, J )), 


where j, j'€{1,..., Ng} and (I, j’) is as defined above. In Section 4.4 we saw that 
player k has a simple strategy p such that, regardless of the other players’ actions, it is 
guaranteed that the internal regret satisfies 


an 
max — <c 
jJ Par Git = 


for a universal constant c. Now, clearly, r; », 18 the conditional expectation of 7 ae j a i 
(k) Fh 
given the past and the other players’ actions. Thus (jj, — Tj, 18 a bounded martingale 


difference sequence for any fixed j, j’. Therefore, by the Hoeffding—Azuma inequality 


(Lemma A.7) and the Borel—Cantelli lemma, we have, for each k € {1,..., K} and j, j’€ 
{1,..., Ne}, 
1 n 
‘ atk) (k) 
sta > Pa! m= almost surely, 


which implies that for each k 


1 k 
lim su max — FH... <0 almost surely. 
P A D Gez X 


n>% J t=1 


The following theorem summarizes what we have just proved. 
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Theorem 7.3. Consider a K -person game and denote its set of correlated equilibria by C. 
Assume that a game is played repeatedly such that each player k = 1,..., K plays accord- 
ing to an internal-regret-minimizing strategy (such as the ones described in Section 4.4). 
Then the distance inf pec 0; |P (i) — P, (i)| between the empirical distribution of plays and 
the set of correlated equilibria converges to 0 almost surely. 


Remark 7.5 (Existence of correlated equilibria). Observe that the theorem implicitly 
entails the existence of a correlated equilibrium of any game. Of course, this fact follows 
from the existence of Nash equilibria. However, unlike the proof of existence of Nash 
equilibria, the above argument avoids the use of fixed-point theorems. 


Remark 7.6 (Computation of correlated equilibria). Given a K -person game, it is a 
complex and important problem to exhibit a computationally efficient procedure that finds 
a Nash equilibrium (or even better, the set of all Nash equilibria) of the game. As of today, 
no polynomial-time algorithm is known to approximate a Nash equilibrium. (The algorithm 
is required to be polynomial in the number of players and the number of actions of each 
player.) The difficulty of the problem may be understood by noting that even if every player 
in a K-person game has just two actions to choose from, there are 2“ possible action 
profiles i = (i1, ...,7x), and so describing the payoff functions already takes exponential 
time. Interesting polynomial-time algorithms for computing Nash equilibria are available 
for important special classes of games that have a compact representation, such as symmetric 
games, graphical games, and so forth. We refer the interested reader to Papadimitriou [230, 
231], Kearns and Mansour [179], Papadimitriou and Roughgarden [232]. Here we point 
out that the regret-based procedures described in this section may also be used to efficiently 
approximate correlated equilibria in a natural computational model. For any ¢ > 0, a 
probability distribution P over the set of all K -tuples i = (i1, . . . , ix ) of actions is called an 
e-correlated equilibrium if, for every player k € {1,..., K}andactions j, j’ € {1,..., Nx}, 


D Pw (COW — LOW, jN) < e. 


i:ik=j 


Clearly, ¢-correlated equilibria approximate correlated equilibria in the sense that if Cs 
denotes the set of all e-correlated equilibria, then Plss Cs = C. Assume that an oracle 
is available that outputs the values of the loss functions ®(i) for any action profile 
i= (i1, ..., ix). Then it is easy to define an algorithm that calls the oracle polynomially 
many times and outputs an ¢-correlated equilibrium. One may simply simulate as if all 
players played an internal-regret-minimizing procedure and calculate the joint distribution 
P, of plays. By the results of Section 4.4, the Hoeffding—Azuma inequality and the union 
bound, with probability at least 1 — ô, for each k = 1,..., K, 


2 Pali) (COW — LOW, 79) < A, m = + / a 


A 2n 
Fap=j 


Thus, with probability at least 1 — ô, P, is an £-correlated equilibrium if 
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To compute all regrets necessary to run this algorithms, the oracle needs to be called not 
more than 


16N;,K NK 
In 
k=1,...,K  §2 ô 


times to find an £-correlated equilibrium, with probability at least 1 — 6. Thus, computation 
of an e-correlated equilibrium is surprisingly fast: it takes time proportional to 1/e? and is 
polynomial (in fact, barely superlinear) in terms of the number of actions and the number 
of players of the game. 


7.5 Unknown Games: Game-Theoretic Bandits 


In this section we mention the possibility of learning correlated equilibria in an even more 
restricted model than the uncoupled model studied so far in this chapter. We assume that 
the K players of a game play repeatedly, and at each time instance f, after taking an action 
i. the kth player observes his loss £“(I,) (where I, = ae. S I) is the joint action 
profile played by the K players) but does not know his entire loss function £® and cannot 
observe the other players’ actions I, . In fact, player k may not even know the number of 
players participating in the game, let alone the number of actions the other players can 
choose from. 

It may come as a surprise that, with such limited information, the players have a way of 
playing that guarantees that the game reaches an equilibrium in some sense. Here we point 
out that there is a strategy such that if all players play according to it, then the joint empirical 
frequencies of play converge to a correlated equilibrium. The issue of convergence to a 
Nash equilibrium is addressed in Section 7.10. 

The fact that convergence of the empirical frequencies of play to a correlated equilibrium 
may be achieved follows directly from Theorem 7.3 and the fact that each player can 
guarantee a small internal regret even if he cannot observe the actions of the other players. 
Indeed, in order to achieve the desired convergence, each player’s job now is to minimize 
his internal regret in the bandit problem described in Chapter 6. Exercise 6.15 shows that 
such internal-regret minimizing procedures exist and therefore, if all players follow such a 
strategy, the empirical frequencies of play will converge to the set of correlated equilibria 
of the game. 


7.6 Calibration and Correlated Equilibrium 


In Section 4.5 we described the connection between calibrated and internal-regret- 
minimizing forecasting strategies. According to the results of Section 7.4, internal-regret- 
minimizing forecasters may be used, in a straightforward way, to achieve correlated equi- 
librium in a certain sense. In the present section we close the circle and point out an 
interesting connection between calibration and correlated equilibria; that is, we show that 
if each player bases his decision on a calibrated forecaster in an appropriate way (where 
the calibrated forecasters used by the different players may be completely different), then 
the joint empirical frequencies of play converge to the set of correlated equilibria. 
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The strategy we assume the players follow is quite natural: on the basis of past experience, 
each player predicts the mixed strategy his opponents will use in the next round and selects 
an action that would minimize his loss if the opponents indeed played according to the 
predicted distribution. More precisely, each player forecasts the joint probability of each 
possible outcome of the next action of his opponents and then chooses a “best response” 
to the forecasted distribution. We saw in Section 4.5 that no matter how the opponents 
play, each player can construct a (randomized) well-calibrated forecast of the opponents’ 
sequence of play. The main result of this section shows that the mild requirement that all 
players use well calibrated forecasters guarantees that the joint empirical frequencies of 
play converge to the set C of correlated equilibria. 

To lighten the exposition, we present the results for K = 2 players, but the results trivially 
extend to the general case (left as exercise). So assume that the actions /, and J; selected 
by the two players at time ¢ are determined as follows. Depending on the past sequence of 
plays, for each j = 1,..., M the row player determines a probability forecast g;,, € [0, 1] 
of the next play J, of the column player where we require that SS dj = 1. Denote the 
forecasted mixed strategy by G, = @is,---,@uus) and write J; = (Iyy,=1),---, I,m). 
We only require that the forecast be well calibrated. Recall from Section 4.5 that this means 
that for any £ > 0, the function 


area laea 
n I N 
Ž = Lqea) 


defined for all subsets A of the probability simplex (defined to be 0 if }7”_, Ig;,ea} = 0) 
satisfies, for all A and for all € > 0, 


P(A) = 


Ja, x dx 


PnlAg) a MAe) 


lim sup | <E. 

n—->oo 
(where A, = {x : dy € A such that ||x — y|| < £} is the e-blowup of A and A stands for the 
uniform probability measure over the simplex.) Recall from Section 4.5 (more precisely 
Exercise 4.17) that regardless of what the sequence J1, J2, ...1s, the row player can deter- 
mine a randomized, well-calibrated forecaster. We assume that the row player best responds 
to the forecasted probability distribution in the sense that 


M 
l = argmin 0 (i, G,) = argmin ) Gj, EGF), 
i=1,..,N i=1, N 4 
poses were oJ t 
Ties may be broken by an arbitrary but constant rule. The tie-breaking rule is described by 
the sets 


a 


Bi = {q : row player plays action i if q, = q} ; ba Ly i5 Ns 


Note that for each 7, B; is contained in the closed and convex set B; of q’s to which i is a 
best reply defined by 


FER 
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Assume also that the column player proceeds similarly, that is, 


N 
ar Cae 7 ay inact 
J; = argmin £  (P,, j) = argmin Pe" j), 
j=1,...,M j=1,..,M iy 


where for each i, the sequence P; (t = 1, 2, . . . ) is a well calibrated forecaster of 71, Io, . 


Theorem 7.4. Assume that in a two-person game both players play by best responding to a 
calibrated forecast of the opponent’s sequence of plays, as described above. Then the joint 
empirical frequencies of play 


aca IE 
PaG, J) = 2 Itu,=i,3,=j) 


converge to the set C of correlated equilibria in the sense that 


inf 2 IPG, 7) — Pali, p| > 0 almost surely. 
tJ 


Proof. Consider the sequence of empirical distributions P,. Since the simplex of all 
joint distributions over {1,...,N} x {1,..., M} is a compact set, every sequence has 
a convergent subsequence. Thus, it suffices to show that the limit of every convergent 
subsequence of P, is in C. Let P, , be such a convergent subsequence and denote its limit 
by P. We need to show that P is a correlated equilibrium. 

Note that by Lemma 7.1, P is a correlated equilibrium if and only if for each 
i € {1,..., N} the conditional distribution 


Pi, 1 P(i,M 
W619 = (001a 10) = ( @ (i, M) ) 


PAPE rao 


is in the set B; and the symmetric statement for the other player holds as well. Therefore, 
it suffices to show that, for each i = 1,..., N , the distribution 


eee P,, (i, 1) P,, (i, M) 
Gne lD = |S... aS 
Vat Pr, (i, JD a Pr, J) 


approaches the set B;. By the definition of Z,, foreach j = 1,..., M 


k 


Pa, J) S >» Lia <8) lUn) 


and therefore 


Since P„, is convergent, the limit 


¥ = lim p,,(B;) 
k->0o 
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exists. We need to show that x € B;. If (B; )e denotes the e-blowup of B; then well-calibration 
of the forecaster q, implies that, for all € > 0, 


Se. % dx 


li nk BY; = 
im sup | Pn (Bide) (Bde) 


Define 


7 8 tm Ln 
e>0 d((Bi)e) 

(The limit exists by the continuity of measure.) Because for all ną we have 

lim,_.9 bn, ((Bi)e) = bn, (Bi)s it is easy to see that x = xX’. But since B; C B; and B; is 

convex, the vector J By. * dx/ ABDe) lies in the -blowup of B;. Therefore we indeed have 

x =x € B;,as desired. W 


7.7 Blackwell’s Approachability Theorem 


We investigate a powerful generalization, introduced by Blackwell, of the problem of 
playing repeated two-player zero-sum games. Consider the situation of Section 7.3 with 
the only but essential difference that losses are vector valued. More precisely, the setup 
is described as follows. Just as before, at each time instance the row player selects an 
action i € {1,..., N} and the column player selects an action j € {1,..., M}. However, 
the “loss” (i, j) suffered by the row player is not a real number in [0, 1] but may take 
values in a bounded subset of R”. (We use bold characters to emphasize that losses are 
vector valued.) 

For the sake of concreteness we assume that all losses are in the euclidean unit ball, that 
is, the entries of the loss matrix £ are such that ||£(7, j)|| < 1. 

In the simpler case of scalar losses, the purpose of the row player is to minimize his 
average loss regardless of the actions of the column player. Then von Neumann’s minimax 
theorem (together with martingale convergence, e.g., the Hoeffding—Azuma inequality) 
asserts that no matter how the column player plays, the row player can always keep the 
normalized accumulated loss 1 Yia LU, J+) in, or very close to, the set (—oo, V], but 
not in the set (—oo, V — e] if e > 0 (see Exercise 7.7). In the case of vector-valued losses 
the general question is to determine which subsets of R” can the row player keep his 
average loss close to. To this end, following Blackwell [28], we introduce the notion of 
approachability: a subset S of the unit ball of R” is approachable (by the row player) if the 
row player has a (randomized) strategy such that no matter how the column player plays, 


f č 
lim d (: > (l, Jt), s) =0 almost surely, 
n 
t=1 


n> 


where d (u, S) = infyes ||u — v|| denotes the euclidean distance of u from the set S. Because 
any set is approachable if and only if its closure is approachable, it suffices to consider only 
closed sets. 

Our purpose in this section is to characterize which convex sets are approachable. In the 
one-dimensional case the minimax theorem can be rephrased as follows: a closed interval 
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(—oo, c] is approachable if and only if c > V. In other words, (—o0, c] is approachable if 
and only if the row player has a mixed strategy p such that max ;j=1,...,m ep, j) <c. 

In the general vector-valued case it is easy to characterize approachability of halfspaces. 
Consider a halfspace defined by H = {u: a-u < c}, where |ja|| = 1. If we define an 
auxiliary game with scalar losses (i, j) = a- (7, j), then clearly, H is approachable if 
and only if the set (—0o, c] is approachable in the auxiliary game. By the above-mentioned 
consequence of the minimax theorem, this happens if and only if max ;j=1,...,m tp, j) = 
maxj=1,...Ma- L(p, J) < c. (Note that, just as earlier, L(p, j)= X; pi &(i, j).) Thus, we 


have proved the following characterization of the approachability of closed halfspaces. 


Lemma 7.3. A halfspace H = {u : a - u < c} is approachable if and only if there exists a 
probability vector p = (pı, ..., pn) such that 

-4p, j) <c. 
max i a-L(p, j) <c 


RPEN 


The lemma states that the halfspace H is approachable by the row player in a repeated 
play of the game if and only if in the one-shot game the row player has a mixed strategy 
that keeps the expected loss in the halfspace H. This is a simple and natural fact. What is 
interesting and much less obvious is that to approach any convex set S, it suffices that the 
row player has a strategy for each hyperplane not intersecting S to keep the average loss on 
the same side of the hyperplane as the set S. This is Blackwell s Approachability Theorem. 
stated and proved next. 


Theorem 7.5 (Blackwell’s approachability theorem). A closed convex set S is approach- 
able if and only if every halfspace H containing S is approachable. 


It is important to point out that for approachability of S, every halfspace containing S 
must be approachable. Even if S can be written as an intersection of finitely many closed 
halfspaces, that is, S = (); H;, the fact that all H; are approachable does not imply that S 
is approachable: see Exercise 7.21. 

The proof of Theorem 7.5 is constructive and surprisingly simple. At every time instance, 
if the average loss is not in the set S, then the row player projects the average loss to S$ 
and uses the mixed strategy p corresponding to the halfspace containing S, defined by the 
hyperplane passing through the projected loss vector, and perpendicular to the direction of 
the projection. In the next section we generalize this “approaching” algorithm to obtain a 
whole family of strategies that guarantee that the average loss approaches S$ almost surely. 

Introduce the notation A; = 1 DY 1 LCs, Js) for the average loss vector at time ¢ and 
denote by zs(u) = argminņes ||u — v|| the projection of u € R” onto S. (Note that zs5(u) 
exists and is unique if S is closed and convex.) 


Proof. To prove the statement, note first that S is clearly not approachable if there exists 
ahalfspace H D S that is not approachable. 

Thus, it remains to show that if all halfspaces H containing S are approachable, then S 
is approachable as well. To this end, assume that the row player’s mixed strategy p, at time 
t = 1,2, ... is arbitrary if A,;_; € S, and otherwise it is such that 


f max a-i: L(p,, J) < Ct-15 
Jala M 
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L(p,, Ji) 


Figure 7.1. Approachability: the expected loss £(p,, J+) is forced to stay on the same side of the 
hyperplane {u : a,_; -u = c,_;} as the target set S, opposite to the current average loss A;_,. 


where 
A1 — 1s(A;-1) 


= and Cy—-1 = A;—1 - Ws(A;_1). 
[A1 — %5(A;_vll a = 


ay] 


(Define Ao as the zero vector.) Observe that the hyperplane {u : a,_; - u = c,;_;} contains 
mts(A;_1) and is perpendicular to the direction of projection of the average loss A,_, to S. 
Since the halfspace {u : a;_; - U < c;_;} contains S, such a strategy p, exists by assumption 
(see Figure 7.1). The defining inequality of p, may be rewritten as 


max ay - (€(p,, j) — ms(Ar-1)) < 0. 


gae 


Since A; = = Arı + EEC, Jı), we may write 


d(A;, SY = |A; — Ts(ApI|? 
< JA; — s(a)? 
pod ed, J;) 2 
= Ari + SS eA 
f= 1 UL, ee ee |? 
= (A1 — m5(Ar-1)) + (L 1) s( t 2| 
2 
t—1 > 1 5 
= TE | Ar—1 — 29(Ar-1) | + za leU, J) = s(A o)l 
t—1 


+2——(Ar-1 — ms(A;-1)) + (€C, J) — 15(Ar-1))- 


Using the assumption that all losses, as well as the set S, are in the unit ball, and therefore 
I&C, Jt) — Ws(Ar_1)|| < 2, and rearranging the obtained inequality, we get 
P A; — s(ADIP — @ — D? Ara — Ts (A)? 
<442¢¢ — D(A = ws(Ar-1)) + (E Je) — ts(Ar—-1)). 
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Summing both sides of the inequality for t = 1,...,n, the left-hand side telescopes 
and becomes n? || A, — t5(A,)||*. Next, dividing both sides by n?, and writing K;_; = 
=l | Ar-1 — zs(A:-1)l|, we have 


|An — 15(A,)II? 
= a 7 2 Se $ UIR Ji) — ms(A;-1)) 
“non a 
4 2% z 
S + P ek : UIR Ji) — &(p,, Jo), 


where at the last step we used the defining property of p,. Since the random variable 
K;_, is bounded between 0 and 2, the second term on the right-hand side is an average of 
bounded zero-mean martingale differences, and therefore the Hoeffding—Azuma inequality 
(together with the Borel—Cantelli lemma) immediately implies that || A, — 7s (An? > 0 
almost surely, which is precisely what we wanted to prove. W 


Remark 7.7 (Rates of convergence). The rate of convergence to 0 of the distance || A, — 
ss(A,)|| that one immediately obtains from the proof is of order n~!/4 (since the upper 
bound for the squared distance contains a sum of bounded martingale differences that, if 
bounded by the Hoeffding—Azuma inequality, gives a term that is O pn" 2)), However, 
this bound can be improved substantially by a simple modification of the definition of 
the mixed strategy p, used in the proof. Indeed, if instead of the average loss vector 
A= DD (Is, Js) one uses the “expected” average loss A; = 1 ya L(p,, J;), one 


easily obtains a bound for ||A,, — 215(A,,)|| that is of order n~!/?; see Exercise 7.23. 


Approachability and Regret Minimization 

To demonstrate the power of Theorem 7.5 we show how it can be used to show the existence 
of Hannan consistent forecasters. Recall the problem of randomized prediction described 
in Section 4.1. In this case the goal of forecaster is to determine, at each round of play, a 
distribution p, (i.e., a mixed strategy in the terminology of this chapter) so that regardless 
of what the outcomes Y, (the opponents’ play) are, the per-round regret 


sees 


has a nonpositive limsup when each /, is drawn randomly according to the distribution p,. 
Equivalently, the forecaster tries to keep the per-round regret vector IR,, of components 
Rin = 1 X (€l, Y,) — (i, Y,)), close to the nonpositive orthant 


S = fu = (u1, ... un) : Yi =1,...N, uj < 0}. 


Thus, the existence of a Hannan consistent forecaster is equivalent to the approachability 
of the orthant S in a two-player game in which the vector-valued losses of the row player 
are defined by £(i, j) whose kth component is 


LOG, j) = LG, j) — «k, j). 


(Note that with this choice the loss vectors fall in the ball, centered at the origin, of radius 
2/N. In the formulation of Theorem 7.5 we assumed that the loss vectors take their 
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{u:a-u=0} a 


Figure 7.2. The normal vector a of any hyperplane corresponding to halfspaces containing the 
nonpositive orthant has nonnegative components. 


values in the unit ball. The extension to the present case is a matter of trivial rescaling.) 
By Theorem 7.5 and Lemma 7.3, S is approachable if for every halfspace {u : a-u < c} 
containing S there exists a probability vector p = (p1, ..., px) such that 


a- £p, j) <c. 
a, (P, j) <c 


Clearly, it suffices to consider halfspaces of the form 


max a- L(p, J) <9, 


j=l,....M 
where the normal vector a = (a1, ..., ay ) is such that all its components are nonnegative 
(see Figure 7.2). This condition is equivalent to requiring that, for all j = 1,..., M, 


N N 
Vart p, i) = Y a (EP, j) — Ek, j)) <0, 
k=1 


k=1 
Choosing 


P a 
Vii ak 


the inequality clearly holds for all j (with equality). Since a has only nonnegative compo- 
nents, p is indeed a valid mixed strategy. In summary, Blackwell ’s approachability theorem 
indeed implies the existence of a Hannan consistent forecasting strategy. Note that the 
constructive proof of Theorem 7.5 defines such a strategy. Observe that this strategy is 
just the weighted average forecaster of Section 4.2 when the quadratic potential is used. 
In the next section we describe a whole family of strategies that, when specialized to 
the forecasting problem discussed here, reduce to weighted average forecasters with more 
general potential functions. The approachability theorem may be (and has been) used to 
handle more general problems. For example, it is easy to see that it implies the existence 
of internal-regret-minimizing strategies (see Exercise 7.22). 


p 
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7.8 Potential-based Approachability 


The constructive proof of Blackwell’s approachability theorem (Theorem 7.5) shows that 
if S is an approachable convex set, then for any strategy of the row player such that 


max (A-1 — 75(Ar-1)) < E; j) < sup(Ar—1 — 7ts(Ar—1)) +x 
xeS 


the average loss A„ converges to the set S almost surely. The purpose of this section is to 
introduce, as proposed by Hart and Mas-Colell [146], a whole class of strategies for the 
row player that achieve the same goal. These strategies are generalizations of the strategy 
appearing in the proof of Theorem 7.5. 

The setup is the same as in the previous section, that is, we consider vector-valued 
losses in the unit ball, and the row player’s goal is to guarantee that, no matter what 
the opponent does, the average loss A, = 1 al £,, J) converges to a set S$ almost 
surely. Theorem 7.5 shows that if S is convex, this is possible if and only if all linear 
halfspaces containing S are approachable. Throughout this section we assume that S is 
closed, convex, and approachable, and we define strategies for the row player guaranteeing 
that d(A,, S) — 0 with probability 1. 

To define such a strategy, we introduce a nonnegative potential function ® : R” > R 
whose role is to score the current situation of the average loss. A small value of ®(A,) means 
that A, is “close” to the target set S. All we assume about Ẹ is that it is convex, differentiable 
for all x ¢ S, and ®(x) = 0 if and only if x € S. Note that such a potential always exists by 
convexity and closedness of S$. One example is the function ®(x) = infyes |x — yll’. 

The row player uses the potential function to determine, at time f, a mixed strategy p, 
satisfying, whenever A,_; ¢ S, 


max anı’ £(p,, J) < cni, 
Jakas M 


where 


V@(A,_1) 
a1 and Ct—-1 = Sup a_ | - X. 


IVA) zes 


If A,_; € S, the row player’s action can be arbitrary. Observe that the existence of such a p, 
is implied by the fact that S is convex and approachable and by Theorem 7.5. Geometrically, 
the hyperplane tangent to the level curve of the potential function ® passing through A,_; 
is shifted so that it intersects § but S falls entirely on one side of the shifted hyperplane. 
The distribution p, is determined so that the expected loss €(p,, j) is forced to stay on 
the same side of the hyperplane as S (see Figure 7.3). Note also that, in the special case 
when (x) = infyes ||x — y||ĉ?, this strategy is identical to Blackwell’s strategy defined in 
the proof of Theorem 7.5, and therefore the potential-based strategy may be thought of as 
a generalization of Blackwell’s strategy. 


Remark 7.8 (Bregman projection). For any x ¢ S we may define 


Is5(X) = argmax V®(x) - y. 
ye 
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{x : ®(x) = const.} 


{u : a1 -U = Cra} 


Figure 7.3. Potential-based approachability. The hyperplane {u : a;—ı -u = c;_;} is determined by 
shifting the hyperplane tangent to the level set {x : ®(x) = const.} containing A;—; to the Bregman 
projection z5(A,;_,). The expected loss £(p,, J,) is guaranteed to stay on the same side of the shifted 
hyperplane as S. 


It is easy to see that the conditions on ® imply that the maximum exists and is unique. In 
fact, since ®(y) = 0 for all y € S, 5(x) may be rewritten as 


s(x) = (seni P(y) — B(x) — VË). (y — ») = argmin Doa(y, x). 
yes yes 


In other words, zs(x) is just the projection, under the Bregman divergence, of x onto the 
set S. (For the definition and basic properties of Bregman divergences and projections, see 
Section 11.2.) 


This potential-based strategy has a very similar version with comparable properties. This 
version simply replaces A;_; in the definition of the algorithm by A;_; where, for each 
t=1,2,...,A,= 1 Ya L(p,, J;) is the “expected” average loss, see also Exercise 7.23. 
Recall that 


N 
(py, Js) =) pis €G, Js) = E [EU Js) | 1]. 


i=l 


The main result of this section states that for any potential function, the row player’s average 
loss approaches the target set S$ no matter how the opponent plays. The next theorem states 
this fact for the modified strategy (based on A;,—1). The analog statement for the strategy 
based on A,_; may be proved similarly, and it is left as an exercise. 
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Theorem 7.6. Let S be a closed, convex, and approachable subset of the unit ball in 
R”, and let ® be a convex, twice differentiable potential function that vanishes in S and 
is positive outside of S. Denote the Hessian matrix of ® at x € R” by H(x) and let 
B = supy: jxy<1 || (X)|| < 00 be the maximal norm of the Hessian in the unit ball. For each 
t = 1,2, ... let the potential-based strategy p, be any strategy satisfying 

max a- -AP j) <cr-1 if Ar1 ES ifA és, 


D a Bae 
where 


VĒ(A;,—1) = 
a_}= ee and Ct—-1 = SUP a1 +X = a1 - Ws(Ay_}). 
IV®A,-)ll xes 
(If A;_1 € S, then the row player’s action can be arbitrary.) Then for any strategy of the 
column player, the average “expected” loss satisfies 
2B(Inn + 1) 


n 


(A,) < 


Also, for the average loss An = 1 yr KL, Jr), we have 


lim d(A,, S) =0 with probability 1. 
n—->Oo 


Proof. First note that for any x ¢ S andy eS, 
VER): (y — x) < — (x). 
This follows simply from the fact that by convexity of ®, 
0 = Py) > B(x) + V(X) - (y — x). 
The rest of the proof is based on a simple Taylor expansion of the potential ®(A,) around 
@(A,_,). Since A, = Ay, + 7 (ep, Ji) - A,-1), we have 
OK) = OB.) + VOR») Ep, J) - A) 


i = AT = = 
a z2 (€(p,. J) — A-1) HE) (€(p,, Jr) — Ar-1) 
(where & is a vector between A, and A,_ 1) 


an 1 a = as 
< (A1) + TVA) : (£p, h= A;—1) 


1 i5 Z 2 
taa sD- Anh sp EOI 


Ells 
(by the Cauchy—Schwarz inequality) 
Z 1 = = £ 2B 
< (A1) + TYPA) 3 (st5(Ay—1) = A,-1) T F 
(by the defining property of p,) 
2B 


= jo 
< P(A;_1) — TPA) T T 


(by the property noted at the beginning of the proof). 
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Multiplying both sides of the obtained inequality by t we get 
= = 2B 
t®(A,) < (t — D(A 1) + z for allt > 1. 


Summing this inequality for t = 1, ..., n, we have 
= "2B 
PA, <2, 


which proves the first statement of the theorem. The almost sure convergence of A, to S 
may now be obtained easily using the Hoeffding—Azuma inequality, just as in the proof of 
Theorem 7.5. E 


7.9 Convergence to Nash Equilibria 


We have seen in the previous sections that if all players follow simple strategies based on 
(internal) regret minimization, then the joint empirical frequencies of play approach the 
set of correlated equilibria at a remarkably fast rate. Here we investigate the considerably 
more difficult problem of achieving Nash equilibria. 

Just as before, we only consider uncoupled strategies; that is, each player knows his 
own payoff function but not that of the rest of the players. In this section we assume 
standard monitoring; that is, after taking an action, all players observe the other play- 
ers’ moves. In other words, the participants of the game know what the others do, but 
they do not know why they do it. In the next section we treat the more difficult case 
in which the players only observe their own losses but not the actions the other players 
take. 

Our main concern here is to describe strategies of play that guarantee that no mat- 
ter what the underlying game is, the players end up playing a mixed strategy profile 
close to a Nash equilibrium. The difficulty is that players cannot optimize their play 
because of the assumption of uncoupledness. The problem becomes more pronounced 
in the case of an “unknown game” (i.e., when players only observe their own payoffs 
but know nothing about what the other participants of the game do), treated in the next 
section. 

As mentioned in the introduction, in many cases, including two-person zero-sum 
games and K-person games in which every player has two actions (i.e., Ny = 2 for 
all k), simple fictitious play is sufficient to achieve convergence to a Nash equilibrium 
(in a certain weak sense). However, the case of general games is considerably more 
difficult. 

To describe the first simple idea, assume that the game has a pure action Nash equi- 
librium, that is, a Nash equilibrium z concentrated on a single vector i = (i1, ... , ix) of 
actions. In this case it is easy to construct a randomized uncoupled strategy of repeated 
play such that if all players follow such a strategy, the joint strategy profiles converge to 
a Nash equilibrium almost surely. Perhaps the simplest such procedure is described as 
follows. 
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AN UNCOUPLED STRATEGY TO FIND PURE 
NASH EQUILIBRIA 


Strategy for player k. 
Fort = 1,2,... 


(1) ift is odd, choose an action J M randomly, according to the uniform distribution 
over {1,..., Nx}; 
(2) if t is even, let 
1 = 1 if eO) < min, LO, i) 
2 otherwise. 


(3) If t is even and all players have played action |, then repeat forever the action 
played in the last odd period. 


In words, at each odd period the players choose an action randomly. At the next period 
they check if their action was a best response to the other players’ actions and communicate 
it to the other players by playing action 1 or 2. If all players have best responded at some 
point (which they confirm by playing action 1 in the next round), then they repeat the 
same action forever. This strategy clearly realizes a random exhaustive search. Because the 
game is finite, eventually, almost surely, by pure chance, at some odd time period a pure 
action Nash equilibrium i = (i1, ..., ig ) will be played (whose existence is guaranteed by 
assumption). The expected time it takes to find such an equilibrium is at most 2 We, Nx, 
a quantity exponentially large in the number K of players. By the definition of Nash 
equilibrium all players best respond at the time i is played and therefore stay with this 
choice forever. 


Remark 7.9 (Exhaustive search). The algorithm shown above is a simple form of exhaus- 
tive search. In fact, all strategies that we describe here and in the next section are variants 
of the same principle. Of course, this also means that convergence is extremely slow in 
the sense that the whole space (exponentially large as a function of the number of players) 
needs to be explored. This is in sharp contrast to the internal-regret-minimizing strategies 
of Section 7.4 that guarantee rapid convergence to the set of correlated equilibria. 


Remark 7.10 (Nonstationarity). The learning strategy described above may not be very 
appealing because of its nonstationarity. Exercise 7.25 describes a stationary strategy pro- 
posed by Hart and Mas-Colell [149] that also achieves almost sure convergence to a pure 
action Nash equilibrium whenever such an equilibrium exists. 


Remark 7.11 (Multiple equilibria). The procedure always converges to a pure action Nash 
equilibrium. Of course, the game may have several Nash equilibria: some pure, others 
mixed. Some equilibria may be “better” than others, sometimes in quite a strong sense. 
However, the procedure does not make any distinction between different equilibria, and the 
one it finds is selected randomly, according to the uniform distribution over all pure action 
Nash equilibria. 
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Remark 7.12 (Two players). It is important to note that the case of K = 2 players is 
significantly simpler. In a two-player game (under a certain genericity assumption) it 
suffices to use a strategy that repeats the previous play if it was a best response and selects 
an action randomly otherwise (see Exercise 7.26). However, if the game is not generic or 
in the case of at least three players, this procedure may enter in a cycle and never converge 
(Exercise 7.27). 


Next we show how the ideas described above may be extended to general games that 
may not have a pure action Nash equilibrium. The simplest version of this extension 
does not quite achieve a Nash equilibrium but approximates it. To make the statement 
precise, we introduce the notion of an e-Nash equilibrium. Let € > 0. A mixed strategy 
profile x = p\? x --- x p™ is an e-Nash equilibrium if for all k = 1,..., K and all mixed 
strategies q, 


ml < me +e, 


where 7, = p® x --- x q® x --- x p“ denotes the mixed strategy profile obtained by 
replacing p“ by q and leaving all other players’ mixed strategies unchanged. Thus, 
the definition of Nash equilibrium is modified simply by allowing some slack in the 
defining inequalities. Clearly, it suffices to check the inequalities for mixed strategies q® 
concentrated on a single action iz. The set of e-Nash equilibria of a game is denoted by N;. 

The idea in extending the procedure discussed earlier for general games is that each 
player first selects a mixed strategy randomly and then checks whether it is an approximate 
best response to the others’ mixed strategies. Clearly, this cannot be done in just one round 
of play because players cannot observe the mixed strategies of the others. But if the same 
mixed strategy is played during sufficiently many periods, then each player can simply 
test the hypothesis whether his choice is an approximate best response. This procedure is 
formalized as follows. 


A REGRET-TESTING PROCEDURE TO FIND NASH EQUILIBRIA 
Strategy for player k. 
Parameters: Period length T , confidence parameter p > 0. 


Initialization: Choose a mixed strategy ne? randomly, according to the uniform dis- 
tribution over the simplex D; of probability distributions over {1,..., Nx}; 


Fort = 1,2,..., 


(1) if t =mT + for integers m > 0 and 1 < s < T — 1, choose 1® randomly 
according to the mixed strategy 7); 
(2) ift = mT for an integer m > 1, then let 


$ mT—1 
x 1 if roa D s-on- DT wb) 
I = < MiNi =1,.., N Ta (mT 41 COT, i+ p 
2 otherwise; 
(3) If t = mT and all players have played action 1, then play the mixed strat- 


egy ca , forever; otherwise, choose a) randomly, according to the uniform 
distribution over D. 
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In this procedure every player tests, after each period of length T , whether he has been 
approximately best responding and sends a signal to the rest of the players at the end of the 
period. Once every player accepts the hypothesis of almost best response, the same mixed 
strategy profile is repeated forever. The next simple result summarizes the performance of 
this procedure. 


Theorem 7.7. Assume that every player of a game plays according to the regret-testing 
procedure described above that is run with parameters T and p < 1/2 such that (T — 
1p? > 21n ar Nx. With probability 1, there is a mixed strategy profile x = p® x --» x 
p“™ such that x is played for all sufficiently large m. The probability that x is not a 
2p-Nash equilibrium is at most 


8T p24 In (tae) e 2 -D0 


where d = TEN k — 1) is the number of free parameters needed to specify any mixed 
strategy profile. 


It is clear from the bound of the theorem that for an arbitrarily small value of p, if T is 
sufficiently large, the probability of not ending up in a 2¢-Nash equilibrium can be made 
arbitrarily small. This probability decreases with T at a remarkably fast rate. Just note that 
if Tp is significantly larger than d Ind, the probability may be further bounded by ge 
Thus, the size of the game plays a minor role in the bound of the probability. However, the 
time the procedure takes to reach the near-equilibrium state depends heavily on the size of 
the game. The estimates derived in the proof reveal that this stopping time is typically of 
order T p~“, exponentially large in the number of players. However, convergence to ¢-Nash 
equilibria cannot be achieved with this method for an arbitrarily small value of £, because 
no matter how T and p are chosen, there is always a positive, though tiny, probability 
that all players accept a mixed strategy profile that is not an e-Nash equilibrium. In the 
next section we describe a different procedure that may be converted into an almost surely 
convergent strategy. 


Proof of Theorem 7.7. Let M denote the (random) index for which every player plays 
1 at time MT. We need to prove that M is finite almost surely. Consider the pro- 
CESS 0, 11, T2, ... of mixed strategy profiles found at times 0,7, 27T,... by the players 
using the procedure. For convenience, assume that players keep drawing a new random 
mixed strategy xm even if m > M and the mixed strategy is not used anymore. In other 
words, 7, 1, T2, ... are independent random variables uniformly distributed over the 
d= TE N: — 1)-dimensional set obtained by the product of the K simplices of the 
mixed strategies of the K players. 
For any € € (0, 1) we may write 


mo—1 
P[M > mo] = > P| tm EN, j times] 
j=0 


x P[ zm is rejected for every m < mo | Tim E Ne j times]. 


We bound the terms on the right-hand side to derive the desired estimate for P[M > mo]. 
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First observe that since the game has at least one Nash equilibrium 7 = p® x --- x P 
any mixed strategy profile 7 = P” x --- x p“ such that 


BH; Sk); 
max max lk) — l <E 
k=1,...,K ig=1,..., ve (ix) p ( o| = 


is an -Nash equilibrium. Thus, for every m = 0, 1, ... we have 
Pfam € Ne] = f. 
This implies that 


P| am EN: j times] = P[ 2m ¢ N: mọ — j times] 
J 


To bound the other term in this expression of P[M > mo], note that if 2, € M, exactly 
j times for m = 0, ..., mọ — 1, then M > mọ implies that at least one player observes a 
regret at least p at all of these j periods. Note that at any given period m, if ¢ > p, by using 
the fact that the regret estimates are sums of T — 1 independent, identically distributed 
random variables, Hoeffding’s inequality and the union-of-events bound imply that 


P[M +m |m € Nz] < P[3k < K 210 =2|am € Ne] 


K 

2 

< NE Nee eee . 
k=1 


Therefore, 


P| itm is rejected for every m < mo | Tm E Ne j times] 


K J 
< (£m) e72iT-De-p 
k=1 


Putting everything together, letting €= 2p, using the assumption (T — 1) o? > 
21n Si Nx, and fixing a jo < mo, we have 


K J 
P[M > mo] < mi eT moD y >D (>: m) eI Vp? 
J<jo J>jo \k=1 

< jome eo- +mo en lof -D02 
Since we are free to choose jọ, we take jọ = [mo roma Gk Inmo)|. To guarantee that 
Jo < mo we assume that mg > p (2 log po) . This implies that mo > p~4(Inmo)*, which 
in turn implies the desired condition jo < mo. Thus, using this choice of mo and after some 
simplification, we get 


P[M > mo] < 2e7700*/2inmo) (7.2) 


But then the Borel—Cantelli lemma implies that M is finite almost surely. 

Now let x denote the mixed strategy profile that the process ends up repeating forever. 
(This random variable is well defined with probability 1 according to the fact that M < oo 
almost surely.) Then the probability that x is not a 2¢-Nash equilibrium may be bounded 
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by writing 
[o6] 


Plz E No | = > P[M =m,n ¢ Nap]. 


m=1 


For m > mo (where the value of mọ is determined later) we simply use the bound P[M = 
m,m ¢ Nap] < P[M_= m]. On the other hand, for any m, by Hoeffding’s inequality, 


P[M =m, x ¢ Nap] < P[M =m E ¢ Na] < eee. 
Thus, for any mo, we get 
Plax ¢ Nop] < moe 7-4 PIM > mol. 


Here we may use the estimate (7.2), which is valid for all mo > poe (2 log pty. A 
convenient choice is to take mọ to be the greatest integer such that mo/(2 1n mọ) < Tp?-4. 
Then mo < 4772-4 In (Loe), and we obtain 


P[x ¢ Nap] < 8Tp* 4 In (Tp*“) e727 Vp? 


as stated. E 


7.10 Convergence in Unknown Games 


The purpose of this section is to show that even in the much more restricted framework 
of “unknown” games described in Section 7.5, it is possible to achieve, asymptotically, 
an approximate Nash equilibrium. Recall that in this model, as the players play a game 
repeatedly, they not only do not know the other players’ loss function but they do not even 
know their own, and the only information they receive is their loss suffered at each round 
of the game, after taking an action. 

The basic idea of the strategy we investigate in this section is somewhat reminiscent of 
the regret-testing procedure described in the previous section. On a quick inspection of the 
procedure it is clear that the assumption of standard monitoring (i.e., that players are able to 
observe their opponents’ actions) is used twice in the definition of the procedure. On the one 
hand, the players should be able to compute their estimated regret in each period of length 
T . On the other hand, they need to communicate to the others whether their estimated regret 
is smaller than a certain threshold or not. It turns out that it is easy to find a solution to the 
first problem, because players may easily and reliably estimate their regret even if they do 
not observe others’ actions. This should not come as a surprise after having seen that regret 
minimization is possible in the bandit problem (see Sections 6.7, 6.8, and 7.5). In fact, the 
situation here is significantly simpler. However, the lack of ability of communication with 
the other players poses more serious problems because it is difficult to come up with a 
stopping rule as in the procedure shown in the previous section. 

The solution is a procedure in which, just as before, time is divided into periods of length 
T, all players keep testing their regret in each period, and they stay with their previously 
chosen mixed strategy if they have had a satisfactorily small estimated regret. Otherwise 
they choose a new mixed strategy randomly, just like above. To make the ideas more 
transparent, first we ignore the issue of estimating regrets in the model of unknown game 
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and assume that, after every period (m — 1)T + 1,...,mT, each player k = 1,...,K can 
compute the regrets 


mT mT 


1 1 
„O (k) D- i 
Fm, ir = T 5 e's) — T y £ (I, 1x) 
s=(m—1)T +1 s=(m—1)T+1 
for all i, = 1,..., Nx. Now we may define a regret-testing procedure as follows. 


EXPERIMENTAL REGRET TESTING 
Strategy for player k. 


Parameters: Period length T, confidence parameter p > 0, exploration parameter 
A> 0. 


Initialization: Choose a mixed strategy a randomly, according to the uniform dis- 
tribution over the simplex D, of probability distributions over {1,..., Nx}; 


Fort = 1,2,... 


(1) ift = mT +s for integers m > Oand 1 < s < T, choose I” randomly accord- 
ing to the mixed strategy 7; 

(2) ift = mT for an integer m > 1, then 
o if 


then choose 7r{® randomly according to the uniform distribution over Dx; 
e otherwise, with probability 1 — A let 7 = ane and with probability A 
choose 2 randomly according to the uniform distribution over Dx. 


The parameters pọ and T play a similar role to that played in the previous section. The 
introduction of the exploration parameter A > 0 is technical; it is needed for the proofs 
given below but it is unclear whether the natural choice à = 0 would give similar results. 
With à > 0, even if a player has all regrets below the threshold p, the player will reserve 
a positive probability of exploration. A strictly positive value of A guarantees that the 
sequence {Tm} of mixed strategy profiles forms a rapidly mixing Markov process. In fact, 
the first basic lemma establishes this property. For convenience we denote the set Tes Dr 
of all mixed strategy profiles (i.e., product distributions) by &. Also, we introduce the 
notation N = X% N}. 


Lemma 7.4. The stochastic process {Tm}, m = 0, 1, 2, ... , defined by experimental regret 
testing with O < à < 1, is a homogeneous, recurrent, and irreducible Markov chain satis- 
fying Doeblin’s condition. In particular, for any measurable set A C ©, 


P(x > A) > A" UA) 


for every x € X, where P(x —> A) = P[am+1 € A | Am = 7] denotes the transition prob- 
abilities of the Markov chain, u denotes the uniform distribution on X, and m is any 
nonnegative integer. 
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Proof. To see that the process is a Markov chain, note that at each m = 0,1, 2,..., 
Im depends only on z,,_; and the regrets on (ik = 1,..., Ng, k =1,..., K). It is irre- 
ducible since at each 0, T, 2T, ..., the probability of reaching some z/, € A for any open 
set A C & from any 7,,_; € & is strictly positive when A > 0, and it is recurrent since 
Y ao Tinea} | To € A] = œ for all 2 € A. Doeblin’s condition follows simply from 
the presence of the exploration parameter A in the definition of experimental regret testing. 
In particular, with probability A” every player chooses a mixed strategy randomly, and 


conditioned on this event the distribution of z,, is uniform. W 


The lemma implies that {7t} is a rapidly mixing Markov chain. The behavior of such 
Markov processes is well understood (see, e.g., the monograph of Meyn and Tweedie [217] 
for an excellent coverage). The properties we use subsequently are summarized in the 
following corollary. 


Corollary 7.2. For m = 0, 1,2, ... let Pj, denote the distribution of the mixed strategy 
profile nm = (7, ..., 7“) chosen by the players at time mT , that is, P(A) = Platm € 
A]. Then there exists a unique probability distribution Q over & (the stationary distribution 
of the Markov process) such that 


sup [Pn(A) = QAI < (1—2%)", 


where the supremum is taken over all measurable sets A C & (see [217, Theorem 16.2.4]). 
Also, the ergodic theorem for Markov chains implies that 


li : a = d Imost surel 
Wm y GRF By Q(z) almost surely. 


The main idea behind the regret-testing heuristics is that, after a not very long search period, 
by pure chance, the mixed strategy profile mm will be an e-Nash equilibrium, and then, 
since all players have a small expected regret, the process gets stuck with this value for 
a much longer time than the search period. The main technical result needed to justify 
such a statement is summarized in Lemma 7.5. This implies that if the parameters of the 
procedure are set appropriately, the length of the search period is negligible compared with 
the length of time the process spends in an e-Nash equilibrium. The proof of Lemma 7.5 is 
quite technical and is beyond the scope of this book. See the bibliographic remarks for the 
appropriate pointers. In addition, the proof requires certain properties of the game that are 
not satisfied by all games. However, the necessary conditions hold for almost all games, in 
the sense that the Lebesgue measure of all those games that do not satisfy these conditions 
is 0. (Here we consider the representation of a game as the K Me, N,-dimensional vector of 
all losses €“)(i).) Let N; = E \ M: denote the complement of the set of ¢-Nash equilibria. 


Lemma 7.5. For almost all K -person games there exist positive constants c1, C2 such that, 
for all sufficiently small p > 0, the K-step transition probabilities of experimental regret 
testing satisfy 


P(N, > N,) = c1p®, 
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where we use the notation P(A > B) = P[7tm4x € B | 2m € A] for the K -step transi- 
tion probabilities. 


On the basis of this lemma, we can now state one of the basic properties of the experimental 
regret-testing procedure. The result states that, in the long run, the played mixed strategy 
profile is not an approximate Nash equilibrium at a tiny fraction of time. 


Theorem 7.8. Almost all games are such that there exists a positive number £o and positive 
constants c,,...,¢4 such that for all e < &Q if the experimental regret-testing procedure is 
used with parameters 


1 
e (e, 1), ÀA <e”, dT > —-———_ 1 Dr 
p E(e,e +e") < E an = 3 —=p og (c4e ) 


then for all M > log(e/2)/ log — AX), 


Pu.) = Plomr ¢ Ne] < e. 


Proof. First note that by Corollary 7.2, 
Pu.) < OW.) + (1— 4%)", 


so that it suffices to bound the measure of M, under the stationary probability Q. To this 
end, first observe that, by the defining property of the stationary distribution, 


O(Np) = OWN p) PON p > No) + OINp)PON > Np). 
Solving for QWN,) gives 


PON, > Np) 
— PON, > N,) + PON, > Np) 


QN) = i (7.3) 


To derive a lower bound for the expression on the right-hand side, we write the elementary 
inequality 

ONDPON, > No) 

ON>) 
4 O(N, \ NPN, \ Ne > No) 
ON) 
_ QN)P ON, > No) 
~ ON>) l 


To bound P(N: —> N), note that if 7, € Me, then the expected regret of all players 
is at most £. Since the regret estimates r£ 7 are sums of T independent random variables 


taking values between 0 and 1 with mean at most £, Hoeffding’s inequality implies that 


PON, > Np) = 


(7.4) 


k = =e? . 
PrE, zo] see, i=l, Neo k=1,...,K. 


Then the probability that there is at least one player k and a strategy i, < Nx such that 
r > p is bounded by X$, Npe~220-*” = Ne~- Thus, with probability at least 


mM,ip ^ 
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a- (1-— Nee =), all players keep playing the same mixed strategy, and therefore 
P(N, > Nz) > (= AE (1 — Neee), 
Consequently, since p > £, we have PO (N; > Np) > P(N: —> Ne) and hence 
PON: > Np) = PON: > Ne) = PW: > N)“ 
> (1X (1 — Neo) > 1 — K? — NKe Te 


(where we assumed that à < 1 and N e7270} < 1). Thus, using (7.4) and the obtained 
estimate, we have 


_ LW 
~ ON) 
Next we need to show that, for proper choice of the parameters, P O(N p > N,) is 


sufficiently large. For almost all of K -person games, this follows from Lemma 7.5, which 
asserts that 


PON, > Ny) (1 = K?à — NK eTO), 


PON, > Np) = Cip® 
for some positive constants Cı and C2 that depend on the game. Hence, from (7.3) we 
obtain 


Cip? 
= —e)2\ QW.) ` 
— (1 — K2. — N Ke-7 0-8) oat + Cip©@ 


O(N,) = i 


It remains to estimate the measure QVV,)/Q(NV,). We need to show that the ratio is close 
to 1 whenever p — € < e. It turns out that one can show that, in fact, for almost every game 
there exists a constant C5 such that 


QIN.) Calo - e) 
ON) ~ pa” 


where C3 and C4 are positive constants depending on the game. This inequality is not 
surprising, but the rigorous proof of this statement is somewhat technical and is skipped 
here. It may be found in [126]. Summarizing, 


— e)C4 
ON.) = QW) (1 - oe) 
p 5 


3 (1- cent) 
p 5 


Cip” 
x a = 
I= (1 — KA — NK eM") (1 — SEM) + Cy 9 
for some positive constants C;,..., Cs. Choosing the parameters p, A, T with appropriate 
constants c,,..., C4, we have 


OW,) < &/2. 


If M is so large that (1 — AX)” < ¢/2, we have Py(N;) < £, as desired. W 
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Theorem 7.8 states that if the parameters of the experimental regret-testing procedure are 
set in an appropriate way, the mixed strategy profiles will be in an approximate equilibrium 
most of the time. However, it is important to realize that the theorem does not claim 
convergence in any way. In fact, if the parameters T, p, and A are kept fixed forever, the 
process will periodically abandon the set of e-Nash equilibria and wander around for a 
long time before it gets stuck in a (possibly different) ¢-Nash equilibrium. Then the process 
stays there for an even much longer time before leaving again. However, since the process 
{Tm} forms an ergodic Markov chain, it is easy to deduce convergence of the empirical 
frequencies of play. Specifically, we show next that if all players play according to the 
experimental regret-testing procedure, then the joint empirical frequencies of play converge 
almost surely to a joint distribution P that is in the convex hull of e-Nash equilibria. The 
precise statement is given in Theorem 7.9. 

Recall that for each t = 1,2,... we denote by 1P the pure strategy played by the 
kth player. i is drawn randomly according to the mixed strategy 7 whenever t € 
{mT + 1,...,(@-+ 1)T}. Consider the joint empirical distribution of plays P, defined by 


N en 
PD = 7D Maen, ie] [0 No 


Denote the convex hull of a set A by co(A). 


Remark 7.13. Recall that Nash equilibria and ¢-Nash equilibria 2 are mixed strategy 
profiles, that is, product distributions, and have been considered, up to this point, as elements 
of the set & of product distributions. However, a product distribution is a special joint 
distribution over the set Tid 1,..., Nx} of pure strategy profiles, and it is this “larger” 
space in which the convex hull of -Nash equilibria is defined. Thus, elements of the convex 
hull are typically not product distributions. (Recall that the convex hull of Nash equilibria 
is a subset of the set of correlated equilibria.) 


Theorem 7.9. For almost every game and for every sufficiently small ¢ > O, there exists a 
choice of the parameters (T , p, à) such that the following holds: there is a joint distribution 
P over the set of K -tuples i = (i, ..., ig ) of actions in the convex hull coWN,) of the set of 
é-Nash equilibria such that the joint empirical frequencies of play of experimental regret 
testing satisfy 


lim P, >P almost surely. 
t>oo 


Proof. fm = (nr) x --- x z) € E is a product distribution, introduce notation 
K 
Ý kr 
P(x, i) = | [ 2G), 
k=1 


where i = (i1, ..., ix). In other words, P(x, -) is a joint distribution over the set of action 
profiles i, induced by x. 
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First observe that, since at time ¢ the vector I, of actions is chosen according to the 
mixed strategy profile x/r], by martingale convergence, for every i, 


1 t 
P,(i) — z 2 P (msr) > 0 almost surely, 


Therefore, it suffices to prove convergence of 1 yo 1 P(Tis/T]-Ì). Since Tys/rj is 
unchanged during periods of length T, we obviously have 


too t 


_iy ye an IS l 
lim Aa ie og 2_ PD: 


m=1 


By Corollary 7.2, 


M 
1 
lim — Im = almost surely, 
M>œ M 2 ve y 
m= 


where 7 = f. z 2 dQ (r). (Recall that Q is the unique stationary distribution of the Markov 
process.) This, in turn, implies by continuity of P(z,i) in m that there exists a joint 
distribution P (i) = J. z P, i) dQ (r) such that, for all i, 


li LS p i) = Pi) Imost surel 
Mos ~ Tm, 1) = 1 almos surely. 


It remains to show that P € co(N,). 
Let ce’ < £ bea positive number such that the £’ blowup of co(V,,) is contained in co(W6), 
that is, 


[P € E : AP’ €co(N,) such that ||P — P'||; < £'} C coN). 


Such an <’ always exists for almost all games by Exercise 7.28. In fact, one may choose 
é’ = £ /c3 for a sufficiently large positive constant c3 (whose value depends on the game). 
Now choose the parameters (T, po, à) such that OW.) < £'. Theorem 7.8 guarantees 
the existence of such a choice. 
Clearly, 


P(x, D= f Pa daom = | P(x, dagen) + f P(x, i) dQ(z). 
x 


1 y 


E' E 


Since f w, PQ) dQ (7) € co(N,-), we find that the L; distance of P and co(N,’) satisfies 


di(P, coN)) < 


f, Prao <f dQ) = QN x) < e'. 
Ny Ng 


1 


By the choice of £' we indeed have P € co(N;). E 


Remark 7.14 (Convergence of the mixed strategy profiles). We only mention briefly that 
the experimental regret testing procedure can be extended to obtain an uncoupled strategy 
such that the mixed strategy profiles converge, with probability 1, to the set of Nash 
equilibria for almost all games. Note that we claim convergence not only of the empirical 
frequencies of plays but also of the actual mixed strategy profiles. Moreover, we claim 
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convergence to M and to the convex hull co(V,) of all e-Nash equilibria for a fixed £. The 
basic idea is to “anneal” experimental regret testing such that first it is used with some 
parameters (T1, 01, 41) fora number M, of periods of length 7; and then the parameters are 
changed to (T2, (2, 42) (by increasing T and decreasing p and À properly) and experimental 
regret testing is used for a number Mz >> M; of periods (of length T2), and so on. However, 
this is not sufficient to guarantee almost sure convergence, because at each change of 
parameters the process is reinitialized and therefore there is an infinite set of indices t such 
that g; is far away from any Nash equilibrium. A possible solution is based on “localizing” 
the search after each change of parameters such that each player limits its choice to a small 
neighborhood of the mixed strategy played right before the change of parameters (unless a 
player experiences a large regret in which case the search is extended again to the whole 
simplex). Another challenge one must face in designing a genuinely uncoupled procedure 
is that the values of the parameters of the procedure (i.e., Te, pe, Ae, and Me, £ = 1, 2,...) 
cannot depend on the parameters of the game, because by requiring uncoupledness we must 
assume that the players only know their payoff function but not those of the other players. 
We leave the details as an exercise. 


Remark 7.15 (Nongeneric games). All results of this section up to this point hold for 
almost every game. The reason for this restriction is that our proofs require an assumption 
of genericity of the game. We do not know whether Theorems 7.8 and 7.9 extend to all 
games. However, by a simple trick one can modify experimental regret testing such that 
the results of these two theorems hold for all games. The idea is that before starting to 
play, each player slightly perturbes the values of his loss function and then plays as if his 
losses were the perturbed values. For example, define, for each player k and pure strategy 
profile i, 


2d) = LPG) + Zis, 


where the Z; s are i.i.d. random variables uniformly distributed in the interval [—e, £]. 
Clearly, the perturbed game is generic with probability 1. Therefore, if all players play 
according to experimental regret testing but on the basis of the perturbed losses, then 
Theorems 7.8 and 7.9 are valid for this newly generated game. However, because for all k 
and i we have |@ (i) — €“(i)| < £, every e-Nash equilibrium of the perturbed game is a 
2e-Nash equilibrium of the original game. 


Finally, we show how experimental regret testing can be modified so that it can be played 
in the model of unknown games with similar performance guarantees. In order to adjust 
the procedure, recall that the only place in which the players look at the past is when they 
calculate the regrets 


mT mT 


1 1 
D (k) (Dyas % 
nip 7 T X e's) — T > £ (I, ig). 


s=(m—1)T+1 s=(m—1)T+1 


However, each player may estimate his regret in a simple way. Observe that the first term in 
the definition of ae is just the average loss player k over the mth period, which is available 
to the player, and does not need to be estimated. However, the second term is the average 
loss suffered by the player if he had chosen to play action ix all the time during this period. 


This can be estimated by random sampling. The idea is that, at each time instant, player 
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k flips a biased coin and, if the outcome is head (the probability of which is very small), 
then instead of choosing an action according to the mixed strategy 2“) the player chooses 
one uniformly at random. At these time instants, the player collects sufficient information 
to estimate the regret with respect to each fixed action ix. 

To formalize this idea, consider a period between times (m — 1)T + 1 and mT. During 
this period, player k draws ną samples for each i, = 1,..., Nx actions, where ny << T 
is to be determined later. Formally, define the random variables U;, € {0,1,..., Nx}, 
where, for s between (m — 1)T + 1, and mT, for each ix = 1,..., Nx, there are exactly 
ng values of s such that Uz s = ix, and all such configurations are equally probable; for 
the remaining s, U;,, = 0. (In other words, for each ix = 1,..., Ng, ng values of s are 
chosen randomly, without replacement, such that these values are disjoint for different 
i,’s.) Then, at time s, player k draws an action J as follows: conditionally on the past up 
to time s — 1, 


1® is distributed as Fos , Ugs =0 
7 equals i, if Uks = ig. 


The regret rí m.i May be estimated by 


ix 


mT mT 


; 1 1 
k k k- : 
PA SFIN. 5 CO )Mu,.=0) — a > COC, Dlui 
KNK mT 41 k s=(m-1)T+1 
k= 1,..., Ng. The first term of the definition of 7 a A is just the average of the losses 


of player k over those periods in which the player does not “experiment,” that is, when 
Ux s = 0. (Note that there are exactly T — Nng such periods.) Since Nng s B this 
average should be close to the first term in the definition of the average regret r, r®. The 
second term is the average over those time periods in which player k Sra. and 
he plays action ią (i.e., when Ug, s = i). This may be considered as an couma, obtained 
by sampling without replacement, of the second term in the definition of rí mi Observe 


that an only depends on the past payoffs experienced by player k, and therefore these 
estimates are feasible in the unknown game model. 

In order to show that the estimated regrets work in this case, we only need to establish 
that the probability that the estimated regret exceeds p is small if the expected regret is not 
more than £ (whenever € < p). This is done in the following lemma. It guarantees that if 
the experimental regret-testing procedure is run using the regret estimates described above, 
then results analogous to Theorems 7.8 and 7.9 may be obtained, in a straightforward way, 


in the unknown-game model. 


Lemma 7.6. Assume that in a certain period of length T, the expected regret 
a[r „0 |L, eke? Inr] of player k is at most £. Then, for a sufficiently small £, with the 


m,ik 
choice of parameters of Theorem 7.8, 


PFE, > p] er ies exp (-T'° (p — e)’) i 


Proof. We show that, with large probability, F i A is close to rË . To this end, first we 
compare the first terms in the expression of both. Observe that at Tose periods s of time 
when none of the players experiments (i.e., when Uçk s = 0 for all k = 1,..., K), the 
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corresponding terms of both estimates are equal. Thus, by a simple algebra it is easy to see 
that the first terms differ by at most 2 ae Nng. 


It remains to compare the second terms in the expressions of aes and pe as Observe 
that if there is no time instant s for which Uz s = 1 and Uy, s = 1 for some k’ Æ k, then 


t+T 


1 
— So OG, ily, .=i) 


£ s=t+1 
is an unbiased estimate of 


t+T 


I 
7, OW, id) 


s=t+l 


obtained by random sampling. The probability that no two players sample at the same time 
is at most 


Nng Nene 
T K? max kik DK SK 
kk<K T T 


’ 


where we used the union-of-events bound over all pairs of players and all T time instants. By 
Hoeffding’s inequality for an average of a sample taken without replacement (see Lemma 
A.2), we have 


| 


where P denotes the distribution induced by the random variables U; s. Putting everything 
together, 


t+T t+T 


1 z 1 Le, 
— 2 OG ilun — FD, COs id 


n 
k s= s=t+1 


= 2 
>a <e 2na , 


k) 
Pre = p] 
Nın N SN i 
Ak Neng 1 Neng 
< TK? kik an, 9 Žk=1 
Sen ft fo el (: $ T 
Choosing nę ~ T'/?, the first term on the right-hand side is of order T7! and 


+ pan Nng = O(T~7/3) becomes negligible compared with p — £. W 
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Regret-minimizing strategies, such as those discussed in Sections 4.2 and 4.3, set up the 
goal of predicting as well as the best constant strategy in hindsight, assuming that the 
actions of the opponents would have been the same had the forecaster been following that 
constant strategy. However, when a forecasting strategy is used to play a repeated game, 
the actions prescribed by the forecasting strategy may have an effect on the behavior of the 
opponents, and so measuring regret as the difference of the suffered cumulative loss and 
that of the best constant action in hindsight may be very misleading. 
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To simplify the setup of the problem, we consider playing a two-player game such that, 
at time rt, the row player takes an action 7, € {1,..., N} and the column player takes action 
J, € {1,..., M}. The loss suffered by the row player at time t is €(/;, J+). (The loss of the 
column player is immaterial in this section. Note also that since we are only concerned with 
the loss of the first player, there is no loss of generality in assuming that there are only two 
players, since otherwise J, can represent the joint play of all other players.) In the language 
of Chapter 4, we consider the case of a nonoblivious opponent; that is, the actions of the 
column player (the opponent) may depend on the history /;,..., /;-1 of past moves of the 
row player. 

To illustrate why regret-minimizing strategies may fail miserably in such a scenario, 
consider the repeated play of a prisoners’ dilemma, that is, a 2 x 2 game in which the loss 
matrix of the row player is given by 


R\C[ c d] 
T 
0 23 


In the usual definition of the prisoners’ dilemma, the column player has the same loss 
matrix as the row player. In this game both players can either cooperate (“c”) or defect 
(“d”). Regardless of what the column player does, the row player is better off defecting 
(and the same goes for the column player). However, it is better for the players if they both 
cooperate than if they both defect. 

Now assume that the game is played repeatedly and the row player plays according 
to a Hannan-consistent strategy; that is, the normalized cumulative loss 1 Y fC: Jt) 
approaches min;=c,a 1 Yai £(i, J+). Clearly, the minimum is achieved by action “d” and 
therefore the row player will defect basically all the time. In a certain worst-case sense 
this may be the best one can hope for. However, in many realistic situations, depending 
on the behavior of the adversary, significantly smaller losses can be achieved. For example, 
the column player may be willing to try cooperation. Perhaps the simplest such strategy 
of the opponent is “tit for tat,’ in which the opponent repeats the row player’s previous 
action. In such a case, by playing a Hannan consistent strategy, the row player’s performance 
is much worse than what he could have achieved by following the expert “c” (which is the 
worse action in the sense of the notions of regret we have used so far). 

The purpose of this section is to introduce forecasting strategies that avoid falling in traps 
similar to the one described above under certain assumptions on the opponent’s behavior. 

To this end, consider the scenario where, rather than requiring Hannan consistency, 
the goal of the forecaster is to achieve a cumulative loss (almost) as small as that of the 
best action, where the cumulative loss of each action is calculated by looking at what 
would have happened if that action had been followed throughout the whole repeated 
game. 

It is obvious that a completely malicious adversary can make it impossible to estimate 
what would have happened if a certain action had been played all the time (unless that 
action is played all the time). But under certain natural assumptions on the behavior of 
the adversary, such an inference is possible. The assumptions under which our goal can 
be reached require a kind of “stationarity” and bounded memory of the opponent and are 
certainly satisfied for simple strategies such as tit for tat. 
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Remark 7.16 (Hannan consistent strategies are sometimes better). We have argued that in 
some cases it makes more sense to look for strategies that perform as well as the best action 
if that action had been played all the time rather than playing Hannan consistent strategies. 
The repeated prisoners’ dilemma with the adversary playing tit for tat is a clear example. 
However, in some other cases Hannan consistent strategies may perform much better than 
the best action in this new sense. The following example describes such a situation: assume 
that the row player has N = 2 actions, and let n be even such that n/2 is odd. Assume 
that in the first n/2 time periods the losses of both actions are 1 in each period. After n/2 
periods the adversary decides to assign losses 00... 000 (1/2 times) to the action that was 
played less times during the first n/2 rounds and 11...111 to the other action. Clearly, a 
Hannan consistent strategy has a cumulative loss of about n /2 during the n periods of the 
game. On the other hand, if any of the two actions is played constantly, its cumulative 
loss is n. 


The goal of this section is to design strategies that guarantee, under certain assumptions on 
the behavior of the column player, that the average loss 1 yor, LCL, J+) is not much larger 
than min;=1,...,N Hin, Where UMi,n is the average loss of a hypothetical player who plays the 
same action J, = i in each round of the game. 

A key ingredient of the argument is a different way of measuring regret. The goal of 
the forecaster in this new setup is to achieve, during the n periods of play, an average loss 
almost as small as the average loss of the best action, where the average is computed over 
only those periods in which the action was chosen by the forecaster. To make the definition 


formal, denote by 


1 t 
i = - LU;, Js 
m pa 5) 


the averaged cumulative loss of the forecaster at time t and by 


Xs Us, JsMMu,=i) 

Desai I= 
the averaged cumulative loss of action 7, averaged over the time periods in which the action 
was played by the forecaster. If ey Iu, =; = 0, let ui, take the maximal value 1. At this 
point it may not be entirely clear how the averaged losses m; are related to the average 
loss of a player who plays the same action i all the time. However, shortly it will become 
clear that these quantities can be related under some assumptions of the behavior of the 


Kis = 


opponent and certain restrictions on the forecasting strategy. 

The property that the forecaster needs to satisfy for our purposes is that, asymptotically, 
the average loss /i, is not larger than the smallest asymptotic average loss j1;,,. More 
precisely, we need to construct a forecaster that achieves 

lim sup, < min limsup Hin- 
nso i=1,..N noo 
Surprisingly, there exists a deterministic forecaster that satisfies this asymptotic inequality 
regardless of the opponent’s behavior. Here we describe such a strategy for the case of 
N = 2 actions. The simple extension to the general case of more than two actions is left as 
an exercise (Exercise 7.30). Consider the following simple deterministic forecaster. 
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DETERMINISTIC EXPLORATION-EXPLOITATION 
For each roundt = 1,2,... 


(1) (Exploration) if t = k? for an integer k, then set J; = 1; 
ift = k? + 1 for an integer k, then set J, = 2; 

(2) (Exploitation) otherwise, let J; = argmin;—1,2 Hi. -1 (in case of a tie, break it, 
say, in favor of action 1). 


This simple forecaster is a version of fictitious play, based on the averaged losses, 
in which the exploration step simply guarantees that every action is sampled infinitely 
often. There is nothing special about the time instances of the form t = k*, k? + 1; any 
sparse infinite sequence would do the job. In fact, the original algorithm of de Farias and 
Megiddo [85] chooses the exploration steps randomly. 

Observe, in passing, that this is a “bandit’’-type predictor in the sense that it only needs 
to observe the losses of the played actions. 


Theorem 7.10. Regardless of the sequence of outcomes Jı, J2, ... the deterministic fore- 
caster defined above satisfies 


lim sup Zi, < min lim sup kin. 
n> oo i=1,2 poo 


Proof. For each t=1,2,..., let i* = argmin,_;y Mi, and let ti, t2,... be the time 
instances such that if Æ iš į, that is, the “leader” is switched. If there is only a finite 
number of such ¢,’s, then, obviously, 


Un — min Mi, n > 0, 
i=l, 


which implies the stated inequality. Thus, we may assume that there is an infinite number 
of switches and it suffices to show that whenever T = max{t, : tg < n}, then either 


. const. 
Hn — MIN Mit < FIA (7.5) 
or 
; const 
HUn — ET Kin S T14 ’ (7.6) 


which implies the statement. 
First observe that, due to the exploration step, for any t > 3 andi = 1, 2, 


t 
X lum- = |vVt =T] > vt/2. 
s=l1 
But then 
2 
ltir — Har| < VF 


This inequality holds because by the boundedness of the loss, at time T, the averaged loss 
of action 7 can change by at most 1/ ee 1 lu= < 2/ VT, and the definition of the switch 
is that the one that was larger in the previous step becomes smaller, which is only possible 
if the averaged losses of the two actions were already 2/./T close to each other. But then 
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the averaged loss of the forecaster at the time T (the last switch before time n) may be 
bounded by 


T T 
Bs 1 
fir = 7 (> es, Ju, =) + DMs, 131) 
i T T 
— Ir Ir 
T (m: È (n=l) + 2,7 2 ua) 


< min pir + Se. 
i=1,2 


Now assume that T is so large that n — T < T?/+. Then clearly, |Z, — fr| < T?/4/n < 
T~'/4 and (7.5) holds. 

Thus, in the rest of the proof we assume that n — T > T 3/4, It remains to show that Ten 
cannot be much larger than min,=1,2 Mi,n. Introduce the notation 


ô = Un — min HiT - 
i=1,2 


Since 

T n 

fin = — (Zia J+ Yo eh, n) 
t=1 t=T+1 
1 ; : 
= (z min pir +2VT + J eh, n) 
t=T+1 

we have 


De ed, J) = (n — T) min p;,r + ôn — 2VT. 


t=T+1 


Since, apart from at most ~n — T exploration steps, the same action is played between 
times T + 1 and n, we have 


Hix, T yy m=) + Viera CU, J) — Vn = T 
Dai l= + (2 — T) 
ugr Doe Unig + erg C0, J) -vn -T 
Eiai l= + (2 — T) 
Miz,T Lii Tunis + (n — T) minjar2 wir + 6n — 2VT — Jn —T 
Dye lu- + 2 —T) 
Hiz, T (ey Iu,= + — T)) +ôn— 2T —vn-T 
p Dat l= +0- T) 
VT 1 
n—T n—T 
Sy peer 
= ün ok 


IV 


mn Li,n 


> Mitr +ô-—2 


where at the last inequality we used n — T > T?/4. Thus, (7.6) holds in this case. W 
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There is one more ingredient we need in order to establish strategies of the desired 
behavior. As we have mentioned before, our aim is to design strategies that perform well 
if the behavior of the opponent is such that the row player can estimate, for each action, 
the average loss suffered by playing that action all the time. In order to do this, we modify 
the forecaster studied above such that whenever an action is chosen, it is played repeatedly 
sufficiently many times in a row so that the forecaster gets a good picture of the behavior 
of the opponent when that action is played. This modification is done trivially by simply 
repeating each action t times, where the positive integer t is a parameter of the strategy. 


REPEATED DETERMINISTIC EXPLORATION-EXPLOITATION 
Parameter: Number of repetitions T. 
For each round t = 1,2,... 
(1) (Exploration) if t = k?t +s for integers k and s = 0, 1,..., t — 1, then set 
Ip= f; 
ift = (k? + 1)t + s for integers k ands = 0,1,...,7 — 1, then set I; = 2; 


(2) (Exploitation) otherwise, let I, = argmin;—1,2 Mi,tĮt/t]—1 (in case of a tie, break 
it, say, in favor of action 1). 


Theorem 7.10 (as well as Exercise 7.30) trivially extends to this case and the strategy 
defined above obviously satisfies 


lim sup Zi, < min lim sup Hi,n 
noo i=1,2 n> 


regardless of the opponent’s actions and the parameter t. 
Our main assumption on the opponent’s behavior is that, for every action 7, there exists 


a number jZ; € [0, 1] such that for any time instance ¢ and past plays /,..., Ir, 
1 t+T 
= JO UG, Js) T; < êr, 
t s=t41 


where £, is a sequence of nonnegative numbers converging to 0 as t —> oo. (Here the 
average loss is computed by assuming that the row player’s moves are /;,..., Ip, i, i,..., 7.) 
After de Farias and Megiddo [86], we call an opponent satisfying this condition flexible. 
Clearly, if the opponent is flexible, then for any action 7 the average loss of playing the 
action forever is at most //;. Moreover, the performance bound for the repeated deterministic 
exploration—exploitation immediately implies the following. 


Corollary 7.3. Assume that the row player plays according to the repeated deterministic 
exploration—exploitation strategy with parameter t against a flexible opponent. Then the 
asymptotic average cumulative loss of the row player satisfies 


li LS Ji) < min T; + 
imsup — a < min Hj + êr. 
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The assumption of flexibility is satisfied in many cases when the opponent’s long-term 
behavior against any fixed action can be estimated by playing the action repeatedly for a 
stretch of time of length t. This is satisfied, for example, when the opponent is modeled 
by a finite automata. In the example of the opponent playing tit for tat in the prisoners’ 
dilemma described at the beginning of this section, the opponent is clearly flexible with 
€, = 1/t. Note that in these cases one actually has 


1 
a ) LU, Js) — Ti < Er 
T 


| t+t 
s=t+1 


(with m; = 1/3 and m, = 2/3); that is, the estimated average losses are actually close to the 
asymptotic performance of the corresponding action. However, for Corollary 7.3 it suffices 
to require the one-sided inequality. 

Corollary 7.3 states the existence of a strategy of playing repeated games such that, 
against any flexible opponent, the average loss is at most that of the best action (calculated 
by assuming that the action is played constantly) plus the quantity £, that can be made 
arbitrarily small by choosing the parameter t of the algorithm sufficiently large. However, 
sequence €, depends on the opponent and may not be known to the forecaster. Thus, 
it is desirable to find a forecaster whose average loss actually achieves min;=1,...,N Mi 
asymptotically. Such a method may now easily be constructed. 


Corollary 7.4. There exists a forecaster such that whenever the opponent is flexible, 


1 n 
li -y «l, J) < min T. 
ae (L )S n 


We leave the details as a routine exercise (Exercise 7.32). 


Remark 7.17 (Randomized opponents). In some cases it may be meaningful to consider 
strategies for the adversary that use randomization. In such cases our definition of flexibility, 
which poses a deterministic condition on the opponent, is not realistic. However, the 
definition may be easily modified to accommodate a possibly randomized behavior. In fact, 
the original definition of de Farias and Megiddo [86] involves a probabilistic assumption. 


7.12 Bibliographic Remarks 


Playing and learning in repeated games is an important branch of game theory with an exten- 
sive literature. In this chapter we addressed only a tiny corner of this immense subject. The 
interested reader may consult the monographs of Fudenberg and Levine [119], Sorin [276], 
and Young [316]. Hart [144] gives an excellent survey of regret-based uncoupled learning 
dynamics. 

von Neumann’s minimax theorem is the classic result of game theory (see von Neumann 
and Morgenstern [296]), and most standard textbooks on game theory provide a proof. 
Various generalizations, including stronger versions of Theorem 7.1, are due to Fan [93] 
and Sion [271] (see also the references therein). The proof of Theorem 7.1 shown here is a 
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generalization of ideas of Freund and Schapire [114], who prove von Neumann’s minimax 
theorem using the strategy described in Exercise 7.9. 

The notion and the proof of existence of Nash equilibria appears in the celebrated 
paper of Nash [222]. For the basic results on the Nash convergence of fictitious play, see 
Robinson [246], Miyasawa [218], Shapley [265], Monderer and Shapley [220]. Hofbauer 
and Sandholm [163] consider stochastic fictitious play, similar, in spirit, to the follow- 
the-perturbed-leader forecaster considered in Chapter 4, and prove its convergence for a 
class of games. See the references within [163] for various related results. Singh, Kearns, 
and Mansour [270] show that a simple dynamics based on gradient-descent yields average 
payoffs asymptotically equivalent of those of a Nash equilibrium in the special case of 
two-player games in which both players have two actions. 

The notion of correlated equilibrium was first introduced by Aumann [16,17]. A direct 
proof of the existence of correlated equilibria, using just von Neumann’s minimax theorem 
(as opposed to the fixed point theorem needed to prove the existence of Nash equilibria) 
was given by Hart and Schmeidler [150]. The existence of adaptive procedures leading to 
a correlated equilibrium was shown by Foster and Vohra [105]; see also Fudenberg and 
Levine [118, 121] and Hart and Mas-Colell [145, 146]. Stoltz and Lugosi [278] generalize 
this to games with an infinite, but compact, set of actions. The connection of calibration and 
correlated equilibria, described in Section 7.6, was pointed out by Foster and Vohra [105]. 
Kakade and Foster [171] take these ideas further and show that if all players play according 
to a best response to a certain common, “almost deterministic,” well-calibrated forecaster, 
then the joint empirical frequencies of play converge not only to the set of correlated 
equilibria but, in fact, to the convex hull of the set of Nash equilibria. Hart and Mas- 
Colell [145] introduce a strategy, the so-called regret matching, conceptually much simpler 
than the internal regret minimization procedures described in Section 4.4, which has the 
property that if all players follow this strategy, the joint empirical frequencies converge 
to the set of correlated equilibria; see also Cahn [44]. Kakade, Kearns, Langford, and 
Ortiz [172] consider efficient algorithms for computing correlated equilibria in graphical 
games. The result of Section 7.5 appears in Hart and Mas-Colell [147]. 

Blackwell’s approachability theory dates back to [28], where Theorem 7.5 is proved. It 
was also Blackwell [29] who pointed out that the approachability theorem may be used to 
construct Hannan-consistent forecasting strategies. Various generalizations of this theorem 
may be found in Vielle [295] and Lehrer [193]. Fabian and Hannan [92] studied rates of 
convergence in an extended setting in which payoffs may be random and not necessarily 
bounded. The potential-based strategies of Section 7.8 were introduced by Hart and Mas- 
Colell [146] and Theorem 7.6 is due to them. In [146] the result is stated under a weaker 
assumption than convexity of the potential function. 

The problem of learning Nash equilibria by uncoupled strategies has been pursued by 
Foster and Young [108, 109]. They introduce the idea of regret testing, which the procedures 
studied in Section 7.10 are based on. Their procedures guarantee that, asymptotically, the 
mixed strategy profiles are within distance £ of the set of Nash equilibria in a fraction of 
at least 1 — £ of time. On the negative side, Hart and Mas-Colell [148, 149] show that it 
is impossible to achieve convergence to Nash equilibrium for all games if one is restricted 
to use stationary strategies that have bounded memory. By “bounded memory” they mean 
that there is a finite integer T such that each player bases his play only on the last T rounds 
of play. On the other hand, for every ¢ > 0 they show a randomized bounded-memory 
stationary uncoupled procedure, different from those presented in Sections 7.9 and 7.10, 
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for which the joint empirical frequencies of play converge almost surely to an e-Nash 
equilibrium. Germano and Lugosi [126] modify the regret testing procedure of Foster and 
Young to achieve almost sure convergence to the set of e-Nash equilibria for all games. 
The analysis of Section 7.10 is based on [126]. In particular, the proof of Lemma 7.5 is 
found in [126], though the somewhat simpler case of two players is shown in Foster and 
Young [109]. 

A closely related branch of literature that is not discussed in this chapter is based on 
learning rules that are based on players updating their beliefs using Bayes’ rule. Kalai and 
Lehrer [176] show that if the priors “contain a grain of truth,” the play converges to a Nash 
equilibrium of the game. See also Jordan [169, 170], Dekel, Fudenberg, and Levine [87], 
Fudenberg and Levine [117,119], and Nachbar [221]. 

Kalai, Lehrer, and Smorodinsky [177] show that this type of learning is closely related 
to stronger notions of calibration and merging. See also Lehrer, and Smorodinsky [196], 
Sandroni and Smorodinsky [258]. 

The material presented in Section 7.11 is based on the work of de Farias and Megiddo [85, 
86], though the analysis shown here is different. In particular, the forecaster of de Farias 
and Megiddo is randomized and conceptually simpler than the deterministic predictor used 
here. 


7.13 Exercises 


7.1 Show that the set of all Nash equilibria of a two-person zero-sum game is closed and convex. 


7.2 (Shapley’s game) Consider the two-person game described by the loss matrices of the two 
players (“R” and “C”), known as Shapley’s game: 


R\C |} 1 2 3 R\C | 1 2 3 
1 0 1 1 1 1 0 1 
2 1 0 1 2 1 1 0 
3 1 1 0 3 0 1 1 


Show that if both players use fictitious play, the empirical frequencies of play do not converge 
to the set of correlated equilibria (Foster and Vohra [105]). 


7.3 Prove Lemma 7.1. 


7.4 Consider the two-person game given by the losses 


Rel? 2 R\C]1 2 
1 1 5 1 |1 0 
2 NO 2 l5 7 


Find all three Nash equilibria of the game. Show that the distribution given by P(1, 1) = 1/3, 
P(1,2) = 1/3, P(2, 1) = 1/3, P(2, 2) = 0 is a correlated equilibrium that lies outside of the 
convex hull of the Nash equilibria. (Aumann [17]). 


7.5 Show that a probability distribution P over Qi, {1,..., Ng} is a correlated equilibrium if and 
only if for all k = 1,..., K, 


E LPM) < EL, 7), 


where I = (J, ..., 7) is distributed according to P and the random variable 7 is any 
function of 7® and of a random variable U independent of I-. 

7.6 Consider the repeated time-varying game described in Remark 7.3, with N = M = 2. Assume 
that there exist positive numbers £, ô such that, for every sufficiently large n, at least for nô time 
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steps t between time 1 and n, 


max |& (1, j) — &(2, j)| > €. 
j=1,2 


Show that then for any sequence of mixed strategies p,, p,,... of the row player, the column 
player can choose his mixed strategies q,, qo, ... such that the row player’s cumulative loss 
satisfies 


Sa, 01- X mint (i, J) > yn 


t=1 


for all sufficiently large n with probability 1, where y is positive. 


Assume that in a two-person zero-sum game, for all t, the row player plays according to the 
constant mixed strategy p, = p, where p is any mixed strategy for which there exists a mixed 
strategy of the column player such that £(p, q) = V . Show that 


lim sup — LSe, J) <V. 


noo 
t=1 


Show also that, for any € > 0, the row player, regardless of how he plays, cannot guarantee that 


lim sup — Ly ih J)<V-e. 


n> 
t=1 


Consider a two-person zero-sum game and assume that the row player plays according to the 
exponentially weighted average mixed strategy 


exp (=n Ja ei, Js)) 
a exp (—n Jai £(k, Js)) f 


Show that, with probability at least 1 — ô, the average loss of the row player satisfies 


Pit = = Teg Ns 


ip nN n 2. 2N 
-XOU J) <V + +o4+,/—-In—. 
n nn 8 n ô 


Freund and Schapire [113] investigate the weighted average forecaster in the simplified version 
of the setup of Section 7.3, in which the row player gets to see the distribution q,_, chosen by 
the column player before making the play at time t. Then the following version of the weighted 
average strategy for the row player is feasible: 


Pit = ied eee \ oe 
Lihi exp (= Dia elk. a) 


with p; ı set to 1/N, where n > 0 is an appropriately chosen constant. Show that this strategy 
is an instance of the weighted average forecaster (see Section 4.2), which implies that 


sp, q) < mnt q) + an geI 


t=1 


= 


where 


N M 
Ea) =J Pir Tie LG j). 


i=l j=l 
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Show that if 7, denotes the actual randomized play of the row player, then with an appropriately 
chosen 7 = nr, 


im — (E l(1,, q) — „min Da q; ) =0 almost surely 


(Freund and Schapire [113]). 


Improve the bound of the previous exercise to 


H InN 
stp, q,) < < min (£z L(p, q,) ®) + = 4 m, 


t=1 


where H(p) = — ya ,Pilnp; denotes the entropy of the probability vector p= 
(Pi, ---, Pn). Hint: Improve the crude bound z In Ox erin) > max; Rin to F In 02 erin) > 
maxp (R; - p + H(p)/7). 

Consider repeated play in a two-person zero-sum game in which both players play such that 


lim — = lh, Ji) = almost surely. 


n> n 


Show that the product distribution P, x q,, with 


1 n 1 n 
aF SOlu- and Gin = 7 lua 
t=1 t=1 


converges, almost surely, to the set of Nash equilibria. Hint: Check the proof of Theorem 7.2. 


Robinson [246] showed that if in repeated playing of a two-person zero-sum game both play- 
ers play according to fictitious play (i.e., choose the best pure strategy against the average 
mixed strategy of their opponent), then the product of the marginal empirical frequencies 
of play converges to a solution of the game. Show, however, that fictitious play does not 
have the following robustness property similar to the exponentially weighted average strat- 
egy deduced in Theorem 7.2: if a player uses fictitious play but his opponent does not, 
then the player’s normalized cumulative loss may be significantly larger than the value of 
the game. 


(Fictitious conditional regret minimization) Consider a two-person game in which the loss 
matrix of both players is given by 
A2 t 2 
1 0 1 
2 1 0 


Show that if both players play according to fictitious play (breaking ties randomly if necessary), 
then Nash equilibrium is achieved in a strong sense. 

Consider now the “conditional” (or “internal”) version of fictitious play in which both players 
k = 1, 2 select 


t—1 


(k) _ (k) 
7 = argmin rei dH 19.21 Oy 5 ie). 


i, e {1,2} 


Show that if the play starts with, say, (1, 2), then both players will have maximal loss in every 
round of the game. 

Show that if in a repeated play of a K -person game all players play according to some Hannan 
consistent strategy, then the joint empirical frequencies of play converge to the Hannan set of 
the game. 
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Consider the two-person zero-sum game given by the loss matrix 
0 0 -lI 
0 0 1 
1 -1 0 
Show that the joint distribution 
13 13 0 
13 0 0 
0 0 0 


is a correlated equilibrium of the game. This example shows that even in zero-sum games the 
set of correlated equilibria may be strictly larger than the set of Nash equilibria (Forges [101]). 
Describe a game for which H \ C # Ø, that is, the Hannan set contains some distributions that 
are not correlated equilibria. 

Show that if P € H is a product measure, then P € M. In other words, the product measures 
in the Hannan set are precisely the Nash equilibria. 

Show that in a K-person game with N; = 2, for all k = 1,..., K (ie., each player has two 
actions to choose from), H = C. 

Extend the procedure and the proof of Theorem 7.4 to the general case of K -person games. 
Construct a game with two-dimensional vector-valued losses and a (nonconvex) set $ C R? 
such that all halfspaces containing S are approachable but S is not. 

Construct a game with vector-valued losses and a closed and convex polytope such that if the 
polytope is written as a finite intersection of closed halfspaces, where the hyperplanes defining 
the halfspaces correspond to the faces of the polytope, then all these closed halfspaces are 
approachable but the polytope is not. 

Use Theorem 7.5 to show that, in the setup of Section 7.4, each player has a strategy such that 
the limsup of the conditional regrets is nonpositive regardless of the other players’ actions. 
This exercise presents a strategy that achieves a significantly faster rate of convergence in 
Blackwell’s approachability theorem than that obtained in the proof of the theorem in the 
text. Let S be a closed and convex set, and assume that all halfspaces H containing S are 
approachable. Define Aj = 0 and A, = 1 yy @p,. Js) for t > 1. Define the row player’s 
mixed strategy p, at time ¢ = 1, 2,... as arbitrary if A,_, € S and by 


max | a1 LP, j) < Tri 


otherwise, where 


A1 — 1s5(A;_ = 
ai = Ani slAn) and T1 = a1 - 5(A;_1). 
|Ar-1 — 75(Ar-1) Il 


Prove that there exists a universal constant C (independent of n and d) such that, with probability 
at least 1 — 6, 


na D 


lA, — 75(ApIl < = +c 


Hint: Proceed as in the proof of Theorem 7.5 to show that || A, — s5(A,|| < 2/./n. To 
obtain a dimension-free constant when bounding ||A, — A,||, you will need an extension of 


the Hoeffding—Azuma inequality to vector-valued martingales; see, for example, Chen and 
White [58]. 
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Consider the potential-based strategy, based on the average loss A,_,, described at the beginning 
of Section 7.8. Show that, under the same conditions on S and ® as in Theorem 7.6, the average 
loss satisfies lim, +. d(A,, S) = 0 with probability 1. Hint: Mimic the proof of Theorem 7.6. 


(A stationary strategy to find pure Nash equilibria) Assume that a K -person game has a pure 
action Nash equilibrium and consider the following strategy for player k: If t = 1, 2, choose 
I ® randomly. If t > 2, if all players have played the same action in the last two periods (i.e., 
I,_; = L-2) and Be was a best response to I_,, then repeat the same play, that is, define 
IP =1 na Otherwise, choose /;” uniformly at random. 

Prove that if all players play according to this strategy, then a pure action Nash equilibrium 
is eventually achieved, almost surely. (Hart and Mas-Colell [149].) 
(Generic two-player game with a pure Nash equilibrium) Consider a two-player game with a 
pure action Nash equilibrium. Assume also that the player is generic in the sense that the best 
reply is always unique. Suppose at time f each player repeats the play of time t — 1 if it was a best 
response and selects an action randomly otherwise. Prove that a pure action Nash equilibrium 
is eventually achieved, almost surely [149]. Hint: The process I, Ip, ... is a Markov chain 
with state space {1,..., Ni} x {1,..., N2}. Show that given any state i = (i), i2), which is not 
a Nash equilibrium, the two-step transition probability satisfies 


P[I, is a Nash equilibrium | I,- = (i), i2)] > c 


for a constant c > 0. 


(A nongeneric game) Consider a two-player game (played by “R” and “C”) whose loss matrices 
are given by 
RCI 1 2 3 R\C | 1 2 3 
1 0 1 0 1 1 0 1 
2 1 0 0 2 0O 1 1 
3 1 1 0 3 0 0 0 


Suppose both players play according to the strategy described in Exercise 7.26. Show that there 
is a positive probability that the unique pure Nash equilibrium is never achieved. (This example 
appears in Hart and Mas-Colell [149].) 


Show that almost all games (with respect to the Lebesgue measure) are such that there exist 
constants c1, C2 > 0 such that for all sufficiently small £ > 0, the set M, of approximate Nash 
equilibria satisfies 


D3 (N, ce) C N: C DAW, c28), 


where DWN, £) = {x EL: |xr-T'llo<e,mwe€ N} is the Lœ neighborhood of the set of 
Nash equilibria, of radius £. (See, e.g., Germano and Lugosi [126].) 

Use the procedure of experimental regret testing as a building block to design an uncoupled 
strategy such that if all players follow the strategy, the mixed strategy profiles converge almost 
surely to a Nash equilibrium of the game for almost all games. Hint: Follow the ideas described 
in Remark 7.14 and the Borel—Cantelli lemma. (Germano and Lugosi [126].) 


Extend the forecasting strategy defined in Section 7.11 to the case of N > 2 actions such that, 
regardless of the sequence of outcomes, 
lim sup, < min limsup kin- 
noo i=l,- n>oo 
Hint: Place the N actions in the leaves of a rooted binary tree and use the original algorithm 


recursively in every internal node of the tree. The strategy assigned to the root is the desired 
forecasting strategy. 
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7.31 Assume that both players follow the deterministic exploration—exploitation strategy while play- 
ing the prisoners’ dilemma. Show that the players will end up cooperating. However, if their 
play is not synchronized (e.g., if the column player starts following the strategy at time t = 3), 
both players will defect most of the time. 

7.32 Prove Corollary 7.4. Hint: Modify the repeated deterministic exploration—exploitation forecaster 
properly either by letting the parameter t grow with time or by using an appropriate doubling 
trick. 
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8.1 Simulatable Experts 


In this chapter we take a closer look at the sequential prediction problem of Chapter 2 in the 
special case when the outcome space is Y = {0, 1}, the decision space is D = [0, 1], and 
the loss function is the “absolute loss” £(p, y) = |p — y|. We have already encountered this 
loss function in Chapter 3, where it was shown that, for general experts, the absolute loss is 
in some sense the “hardest” among all bounded convex losses. We now turn our attention to 
a different problem: the characterization of the minimax regret V, (F) for the absolute loss 
and for a given class F of simulatable experts (recall the definition of simulatable experts 
from Section 2.9). 

In the entire chapter, an expert f means a sequence fi, f2...of functions f, : 
{0, 1¥ 7! — [0, 1], mapping sequences of past outcomes y'~! into elements of the deci- 
sion space. We use F to denote a class of (simulatable) experts f. Recall from Sec- 
tion 2.10 that a forecasting strategy P based on a class of simulatable experts is a 
sequence P1, P2, ... of functions P; : YT! —> D (to simplify notation, we often write P; 
instead of D,(y'—!)). Recall also that the minimax regret V,(F) is defined for the absolute 
loss by 


Vi(F) = int nee (Zo ) mni sO ) 
where L(y") = aa \p,(y'—!) — y,| is the cumulative absolute loss of the forecaster P 
and LO = 2. io" ~!) — y,| denotes the cumulative absolute loss of expert f. The 
infimum is taken over all forecasters P. (In this chapter, and similarly in Chapter 9, we 
find it convenient to make the dependence of the cumulative loss on the outcome sequence 
explicit; this is why we write L(y") for Ln.) 

Clearly, if the cardinality of F is |F| = N, then V, (F) < V,, where V0 is the 
minimax regret with N general experts defined in Section 2.10. As seen in Section 2.2, for all 
n and N, V®™ < ./(n/2)InN. Also, by the results of Section 3.7, sup, y Vi /Vnln N > 
1/v2. 

On the other hand, the behavior of V, (F) is significantly more complex as it depends 
on the structure of the class F. To understand the phenomenon, just consider a class of 
N = 2 experts that always predict the same except for t = n, when one of the experts 
predicts 0 and the other one predicts 1. In this case, since the experts are simulat- 
able, the forecaster may simply predict as both experts if ¢ < n and set p, = 1/2. It is 
easy to see that this forecaster is minimax optimal, and therefore V,(7) = 1/2, which is 
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significantly smaller than the worst-case bound ./(7/2) In 2. In general, intuitively, V,(F) 
is small if the experts are “close” to each other in some sense and large if the experts are 
“spread out.” 

The primary goal of this chapter is to investigate what geometrical properties of F 
determine the size of V, (F). In Section 8.2 we describe a forecaster that is optimal in the 
minimax sense, that is, it achieves a worst-case regret equal to V, (F). The minimax optimal 
forecaster also suggests a way of calculating V,(F) for any given class of experts, and this 
calculation becomes especially simple in the case of static experts (see Section 2.9 for 
the definition of a static expert). In Section 8.3 the minimax regret V,,(F) is characterized 
for static experts in terms of a so-called Rademacher average. Section 8.4 describes the 
possibly simplest nontrivial example that illustrates the use of this characterization. In 
Section 8.5 we derive general upper and lower bounds for classes of static experts in terms 
of the geometric structure of the class F. Section 8.6 is devoted to general (not necessarily 
static) classes of simulatable experts. This case is somewhat more difficult to handle as 
there is no elegant characterization of the minimax regret. Nevertheless, using simple 
structural properties, we are able to derive matching upper and lower bounds for some 
interesting classes of experts, such as the class of linear forecasters or the class of Markov 
forecasters. 


8.2 Optimal Algorithm for Simulatable Experts 


The purpose of this section is to present, in the case of simulatable experts, a forecaster that 
is optimal in the sense that it minimizes, among all forecasters, the worst-case regret 


sup (Z0")~ mint") 
yr e{0, 1} fEF 


that is, 


sp (ZO®- minty") = VP, 
y"e{0,1}" SEF 
where all losses — we recall it once more — are measured using the absolute loss. 

Before describing the optimal forecaster, we note that, since the experts are simulatable, 
the forecaster may calculate the loss of each expert for any particular outcome sequence. 
In particular, for all y” € {0, 1}”, the forecaster may compute inf sez L f(y"). 

We determine the optimal forecaster “backwards,” starting with Pn, and the prediction 
at time n. Assume that the first n — 1 outcomes y”~! have been revealed and we want 
to determine, optimally, the prediction Pa, = Pn(y"'). Since our goal is to minimize the 
worst-case regret, we need to determine Pp, to minimize 


max Zo" + (Pn, 0) — inf Lr(y"'0), 
fEF 
EO”) + &@y, 1) — inf LoD] l 
fEF 


where y”~!0 denotes the string of n bits whose first n — 1 bits are y”~! and the last bit 
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is 0. Minimizing this quantity is equivalent to minimizing 
Pn — inf L-(y"'0), 1 — P, — inf L(y” 'D}. 
max [7 au Came ae sO | 


Clearly, if we write A,(y") = — infer L f(y”), then this is achieved by 


0 if A,(y"~!0) > An(y” 11) +1 
T 1 if A,(y" 0) + 1 < A, (”7!1) 
5 An(y"!1) — An(y"!0) + 1 
5 otherwise. 


A crucial observation is that this expression of P, does not depend on the previous predic- 
tions P1,..., Pn—1. Define 


A10 E mi a — inf L;(y""!0), 1— p, — inf LOID}. 
1O) smm -max Pp m. fO” 0), Pp mi fo" I) 


This may be rewritten as 


AAG OS „min, max {Pn + An(y"'0), 1 = Pn + An(y" D}. 


So far we have calculated the optimal prediction at the last time instance p,,. Next we deter- 
mine optimally },_1, assuming that at time n the optimal prediction is used. Determining 
Pn—1 is clearly equivalent to minimizing 


max {L(y"~*) + €@y—1, 0) + An—1(9"20), 
LO?) + Ln, 1) + An-i(y” *D} 


or, equivalently, to minimizing 
a n—2 ~ n—2 
max {Pn—1 + An-i(y” °0), 1 — Pn- + An-i(y” 7D} . 


The solution is, as before, 


0 if Ap_1(y""-70) > Ani(y” 7) +1 
pe ae 1 if An—1(y"20) + 1< An-10"721) 
= An") — An-1("20) + 1 : 
5 otherwise. 


The procedure may be continued in the same way until we determine 7). 
Formally, given a class F of experts and a positive integer n, a forecaster whose worst- 
case regret equals V, (F) is determined by the following recursion. 
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MINIMAX OPTIMAL FORECASTER 
FOR THE ABSOLUTE LOSS 


Parameters: Class F of simulatable experts. 


1. (Initialization) A,(y") = — inf rer L f(y"). 
2. (Recurrence) Fort =n,n—1,..., 1. 


AO) = n mg {p + A070), [s P F AQ! 1)} 


and 
0 if A,;(y'!0) 
> AOT) +1 
P, = 1 if 4O10 + 1 
i < A4071) 
ATID — AOO + 1 
otherwise. 
2 
Note that the recurrence for A; may also be written as 
A,(y'7!0) if A,(y'7!0) > A,(y'!1) +1 
= AO" !1) if A1710) + 1 < A4;0*!1) 
Anio = i 8.1 
ae A,(y'!1) + A;(y'!0) 1 f (8.1) 
5 otherwise. 


The algorithm for calculating the optimal forecaster has an important by-product: the value 
Ao of the quantity A,_;(y’~!) at the last step (t = 1) of the recurrence clearly gives 


n 
Ao = max |) eny) + aon] = VF). 
t=1 

Thus, the same algorithm also calculates the minimal worst-case regret. In the next section 
we will see some useful consequences of this fact. 


8.3 Static Experts 


In this section we focus our attention on static experts. Recall that an expert f is called 
static if for all £ = 1, 2,... and y7! € {0, 1}7!, fO!) = ff € [0, 1]. In other words, 
static experts’ predictions do not depend on the past outcomes: they are fixed in advance. 
For example, the expert that always predicts 0 regardless of the past outcomes is static, but 
the expert whose prediction is the average of all previously seen outcomes is not static. 

The following simple technical result has some surprising consequences. The simple 
inductive proof is left as an exercise. 


Lemma 8.1. Let F be an arbitrary class of static experts. Then for all t = 1,...,n and 
yl € {0, YEL, 


[AOT D- AQT] <1. 


This lemma implies the following result. 
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Theorem 8.1. If F is a class of static experts, then 


n 1 f 4 
WA = 5-57 DL inf LO"). 
y"e{0, 1}” 


Proof. Lemma 8.1 and (8.1) imply that for all t = 1,..., 7, 


Ai(y'1) + Ary’ 10) + 1 
5 : 


AQ") = 


Applying this equation recursively, we obtain 


1 i n 
Ao = 5 X Any Peci 


yre{0,1}" 


Recalling that V, (F) = Ao and A,(y") = — inf fef L f(y”), we conclude the proof. W 


To understand better the behavior of the value V,,(F), it is advantageous to reformulate 
the obtained expression. Recall that, because experts are static, each expert f is represented 
by a vector (fi, ..., fn) € [0, 1]”, where f, is the prediction of expert f at time t. Since 
Ufan y) = If — yl, Ley”) = X; | ft — yr |. Also, the average over all possible outcome 
sequences appearing in the expression of V,,(#) may be treated as an expected value. To 
this end, introduce i.i.d. symmetric Bernoulli random variables Y;,...,Y, (ie., PIY, = 
0] = PLY, = 1] = 1/2). Then, by Theorem 8.1, 


VF) = 5 — | we Sou 
= 5-H %l) 
fap (3-1-1) 


=> renames A A tlio 8.2 
[6-9 


where o, = 1 — 2Y, are i.i.d. Rademacher random variables (i.e., with Plo, = 1] = 
Plo, = —1] = 1/2). Thus, V, (F) equals n times the Rademacher average 


R, (A) =E un Yaa] 


acA Nl 


associated with the set A of vectors of the form 


a = (a1, ..., an) = (1/2 — fi,..., 1/2 — fn), JEF. 


Rademacher averages are thoroughly studied objects in probability theory, and this will help 
us establish tight upper and lower bounds on V, (F) for various classes of static experts in 
Section 8.5. Some basic structural properties of Rademacher averages are summarized in 
Section A.1.8. 
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8.4 A Simple Example 


We consider a simple example to illustrate the usage of the formula (8.2). In Section 8.5 
we obtain general upper and lower bounds for V, (F) based on the same characterization 
in terms of Rademacher averages. 

Consider the case when F is the class of all “constant” experts, that is, the class of 
all static experts, parameterized by q € [0, 1], of the form f1 = (g,...,q). Thus each 
expert predicts the same number throughout the n rounds. The first thing we notice is that 
the class F is the convex hull of the two “extreme” static experts f® = 0 and f® = 1. 
Since the Rademacher average of the convex hull of a set equals that of the set itself (see 
Section A.1.8 for the basic properties of Rademacher averages), the identity (8.2) implies 
that V, (F) = V,(Fo), where Fo contains the two extreme experts. (One may also easily 
see that for any sequence y” of outcomes the expert minimizing the cumulative loss L p(y”) 
over the class F is one of the two extreme experts.) Thus, it suffices to find bounds for 
V,,(Fo). To this end recall that, by Theorem 2.2, 


in? 
Vi(Fo) < V2 < = ~ 0.58877. 


Next we contrast this bound with bounds obtained directly using (8.2). Since 5 = 0 and 
f = 1 for allt, 


V, (Fo) = : 7 nx [Yoo $-a pa ; x 
isi t=1 


Using the Cauchy—Schwarz inequality, we may easily bound this quantity from above as 
follows: 


Vi(Fo) = 


Observe that this bound has the same order of magnitude as the bound obtained by Theo- 
rem 2.2, but it has a slightly better constant. 

We may obtain a similar lower bound as an easy consequence of Khinchine’s inequality, 
which we recall here (see the Appendix for a proof). 


Lemma 8.2 (Khinchine’s inequality). Let a,,..., an be real numbers, and let 01, ..., On 
be i.i.d. Rademacher random variables. Then 


n 1 n 
J a,o;| > — ae 
z Al z 
Applying Lemma 8.2 to our problem, we obtain the lower bound 


Va (Fo) = Ve 


Summarizing the upper and lower bounds, for every n we have 


VilF) 95 


Jn 


0.3535 < 
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For example, for n = 100 there exists a prediction strategy such that for any sequence 
y1,---, Yioo the total loss is not more than that of the best expert plus 5, but for any 
prediction strategy there exists a sequence yj, ..., Y100 such that the regret is at least 3.5. 
The exact asymptotic value is also easy to calculate: V,(Fo)//n > 1/V2m ~ 0.3989 (see 
the exercises). 


8.5 Bounds for Classes of Static Experts 


In this section we use Theorem 8.1 and some results from the rich theory of empirical 
processes to obtain upper and lower bounds for V,,(F) for general classes F of static 
experts. 

Theorem 8.1 characterizes the minimax regret as the Rademacher average R,(A) of 
the set A = 1/2 — F, where 1/2 = (1/2,..., 1/2), and the class F of static experts is 
now regarded as a subset of R” (by associating, with each static expert f, the vector 
(fi. ---, fn) € [0, 1]"). There are various ways of bounding Rademacher averages. One is 
by using the structural properties summarized in Section A.1.8, another is in terms of the 
geometrical structure of the set A. To illustrate the first method, we consider the following 
example. 


Example 8.1. Consider the class F of static experts f = (f1,..., fn) such that f, = (1 + 


o(b,))/2 for any vector b = (bı, .. . , bn) satisfying ||b\|? = )0"_, b? < 42 for a constant 
à > 0 and 
e — e™* 
OO eer 


is the standard “sigmoid” function. In other words, F contains all experts obtained by 
“squashing” the elements of the unit ball of radius A into the cube [0, 1]”. Intuitively, the 
larger the à, the more complex F is, which should be reflected in the value of V, (F). Next we 
derive an upper bound that reflects this behavior. By Theorem 8.1, V, (F) = nR,(A), where 
A is the set of vectors of the form a = (a1, ..., an), with a; = o(b;)/2 with ye b? <2. 
By the contraction principle (see Section A.1.8), 


1 n X n 
R,(A) < —E| sup cibi | = —E| sup dibi |, 
2n i Ibli<à 3 2n | b:\bii<1 2 


where we used the fact that o is Lipschitz with constant 1. Now by the Cauchy—Schwarz 


inequality, we have 
n 
| sup obi | =E 
we l 


We have thus shown that V, (F) < 4./n/2, and this bound may be shown to be essentially 
tight (see Exercise 8.11). 


A way to capture the geometric structure of the class of experts F is to consider its 
covering numbers. The covering numbers suitable for our analysis are defined as follows. 
For any class F of static experts let No(¥,7) be the minimum cardinality of a set F, of 
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static experts (possibly not all belonging to F) such that for all f € F there exists a g € F, 
such that 


5 (fi — gr)" <r. 
t=1 


The following bound shows how V,,(#) may be bounded from above in terms of these 
covering numbers. 


Theorem 8.2. For any class F of static experts, 
Jn/2 
VSI SNFA 
0 


The result is a straightforward corollary of Theorem 8.1, Hoeffding’s inequality 
(Lemma 2.2), and the following classical result of empirical process theory. To state the 
result in a general form, consider a family {Ty JEF } of zero mean random variables 
indexed by a metric space (F, p). Let N,(#,1r) denote the covering number of the metric 
space F with respect to the metric p. The family is called subgaussian in the metric p 
whenever 


n [ent T9] < er e8) /2 
holds for any f,g € F and à > 0. The family is called sample continuous if for any 
sequence f™, f,... € F converging to some f € F, we have Tym — Tp — 0 almost 


surely. The proof of the following result is given in the Appendix. 


Theorem 8.3. If {Ty : fe F} is subgaussian and sample continuous in the metric p, then 


D/2 
| sp <12 : VinN,(Ff, €) de, 


SEF 


where D is the diameter of F. 


For completeness, and without proof, we mention a lower bound corresponding to The- 
orem 8.2. Once again, the inequality is a straightforward corollary of Theorem 8.1 and 
known lower bounds for the expected maximum of Rademacher averages. 


Theorem 8.4. Let F be an arbitrary class of static experts containing f and g such that 
fı = 0 and g, = 1 for allt =1,...,n. Then there exists a universal constant K > 0 such 
that 


V,(F) => K supr./In N2(F,r). 
r>0 


The bound of Theorem 8.4 is often of the same order of magnitude as that of Theorem 8.2. 
For examples we refer to the exercises. 

The minimax regret V, (F) for static experts is expressed as the expected value of the 
supremum of a Rademacher process. Such expected values have been studied and well 
understood in empirical process theory. In fact, Theorems 8.3 and 8.4 are simple versions 
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of classical general results (known as “Dudley’s metric entropy bound” and “Sudakov’s 
minoration”) of empirical process theory. There exist more modern tools for establishing 
sharp bounds for expectations of maxima of random processes, such as “majorizing mea- 
sures” and “generic chaining.” In the bibliographic comments we point the interested reader 
to some of the references. 


8.6 Bounds for General Classes 


All bounds of the previous sections are based on Theorem 8.1, a characterization of V, (F) 
in terms of expected suprema of Rademacher processes. Unfortunately, no such tool is 
available in the general case when the experts in the class F are not static. This section 
discusses some techniques that may come to rescue. 

We begin with a simple but very useful bound for expert classes that are subsets of 
“convex hulls” of just finitely many experts. 


Theorem 8.5. Assume that the class of experts F satisfies the following: there exist N 
experts f,..., f (not necessarily in F) such that for all f € F there exist convex 
coefficients qi, -~ qu = 0, with Yqj = 1, such that f(y") = aj fh) 
forallt =1,...,n and y'! € {0, 1¥7!. Then 


Vi(F) < J(n/2)InN. 


Proof. The key property is that for any bit sequence y” €e {0, 1}” and expert f = 
Di qj fU € F there exists an expert among f,..., fO? whose loss on y” is not 
larger than that of f. To see this, note that 


L0 =} IoT- vl 


t= 


n N ; 
= ` Yai fo) = 
t=1 | j=1 


n N 

=ð 4 D -yı 
t=1 j=l 
N n 

=} 4} Wor -yı 
j=l t=l 


N 
=o ajL po") 
j=l 


> min | Lro”). 


Thus, for all y” € {0, 1}", inf ser L f(y”) = minj=i,....w L f(y"). This implies that if D is 
the exponentially weighted average forecaster based on the finite class f,..., fO), then, 
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by Theorem 2.2, 


nln N 
TER Zo 


which completes the proof. E 
We now review two basic examples. 


Example 8.2 (Linear experts). As a first example, consider the class £L of kth-order 
autoregressive linear experts, where k > 2 is a fixed positive integer. Because each pre- 
diction of an expert f € £L is determined by the last k bits observed, we add an arbitrary 
prefix y_z41,..., Yo to the sequence y” to be predicted. We use y/_, to denote the resulting 
sequence of n + k bits. The class £; contains all experts f such that 


k 
OLD = oa yi 
i=1 


for some q1, ..., qx = 0, with De qi = 1. In other words, an expert in Lg predicts accord- 
ing to a convex combination of the k most recent outcomes in the sequence. Convexity of 
the coefficients q; assures that FOD € [0, 1]. Accordingly, for such experts the value 
V (F) is redefined by 


V,(F) =inf max L(y"_,) — inf L(y", ; 
n ) max. ( (Yt) FEF oi) 


where 


> 


Lot») = >> POTD- y 
t=1 


and L f(yj_,) is defined similarly. 


Corollary 8.1. For all positive integers n and k > 2, 


Vn (Ly) = = : 


Proof. The statement is a direct consequence of Theorem 8.5 if we observe that £x is the 
convex hull of the k experts f®,..., f defined by FOOGTI = yii = 1,...,k. M 


Example 8.3 (Markov forecasters). For an arbitrary k > 1, we consider the class Mg 
of kth order Markov experts defined as follows. The class M+ is indexed by the set 


[0, 11% so that the index of any f € Mz, is the vector œ = (ao, @,..., @2*_1), with œs € 
[0,1] for O < s < 2‘. If f has index a, then FOT) = a; for all 1 < t < n and for all 
ygi e€ {0, 1}'t*-! where s has binary expansion y;—x, ..., ¥;—-1- Because each prediction 


of a kth-order Markov expert is determined by the last k bits observed, we add a prefix 
Y_k+1,--+, yo to the sequence to predict in the same way we did in the previous example 
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for the autoregressive experts. Thus, the function f; is now defined over the set {0, 1}‘t*—!. 
Once more, Theorem 8.5 immediately implies a bound for V, (M+). 


Corollary 8.2. For any positive integers n and k > 2, 


2knln2 
VM < y = 


Next we study how one can derive lower bounds for V,,(F) for general classes of experts. 
Even though Theorem 8.1 cannot be generalized to arbitrary classes of experts, its analog 
remains true as an inequality. 


Theorem 8.6. For any class of experts F, 


V,(F) > E E D G - rad) a- 2r) , 


JEF t=1 


where Y,,..., Y, are independent Bernoulli (1/2) random variables. 


Proof. For any prediction strategy, if Y, is a Bernoulli (1/2) random variable, then 
i [D71 — Y,| = 1/2 for each y’~!. Hence, 


V,(F) = L(y") — inf L p(y" 
( ) = max, ( (y") ae re) ) 
> E| LO”) — inf L(Y” 
> E ) mi FC | 

n 


= -—E]| inf L(Y" 
2 int i | 


=E he G = fv") (a ar) . B 


JEF j=] 


We demonstrate how to use this inequality for the example of the class of linear forecasters 
described earlier. In fact, the following result shows that the bound of Corollary 8.3 is 
asymptotically optimal. 


Corollary 8.3. 


À 1 
lim inf lim inf Vatu) — 
k>œ n> nlnk RO) 


Proof. We only sketch the proof, the details are left to the reader as an easy exercise. By 
Theorem 8.6, if we write Z, = 1 — 2Y, fort = —k + 1,...,n, then 


n 1 n 
| sp yt = ajaz] > > ) E Zz ; 


fELx PA A E U oak, 


Va (Ly) = 


Nile 
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The proof is now similar to the proof of Theorem 3.7 with the exception that instead of the 
ordinary central limit theorem, we use a generalization of it to martingales. 
Consider the k-vector X, = (Xn,1,..-, Xn,k) of components 


By the Cramér—Wold device (see Lemma 13.11 in the Appendix) the sequence of vec- 
tors {X,} converges in distribution to a vector random variable N = (N,,..., Nx) if and 
only if pa iX n, converges in distribution to ye 1 4N; for all possible choices of the 
coefficients a1, ..., ag. Thus consider 


k : ` 
2 aiXni = v 2 Z: D 


It is easy to see that the sequence of random variables ./n Xn, i, n = 1,2, ..., forms a 
martingale with respect to the sequence of ø -algebras G, generated by Z_441,..., Z;. Fur- 
thermore, by the martingale central limit theorem (see, e.g., Hall and Heyde [140, Theorem 
3.2]), Rae aiX n, i converges in distribution, as n — ov, to a zero-mean normal random 
variable with variance pa ae Then, by the Cramér—Wold device, as n — ov, the vec- 
tor X„ converges in distribution to N = (Ni,..., Nx), where Ni, ..., Ng are independent 
standard normal random variables. The rest of the proof is identical to that of Lemma 13.11 
in the Appendix, except that Hoeffding’s inequality needs to be replaced by its analog for 


bounded martingale differences (see Theorem A.7). MH 


8.7 Bibliographic Remarks 


The forecaster presented in Section 8.2 appears in Chung [60], which also gives an optimal 
algorithm for nonsimulatable experts (see the exercises). See Chung [61] for many related 
results. The form of Khinchine’s inequality cited here is due to Szarek [281]. The proof 
given in the Appendix is due to Littlewood [204]. The example described in Section 8.4 was 
first studied in detail by Cover [68], and then generalized substantially by Feder, Merhav, 
and Gutman [95]. 

Theorem 8.3 is a simple version of Dudley’s metric entropy bound [91]. Theorem 8.4 
follows from a result of Sudakov [280]; see Ledoux and Talagrand [192]. Understanding 
the behavior of the expected maximum of random processes, such as the Rademacher 
process characterizing the minimax regret for classes of static experts, has been an active 
topic of probability theory, and the first important results go back to Kolmogorov. Some of 
the key contributions are due to Fernique [97,98], Pisier [235], and Talagrand [287]. We 
recommend that the interested reader consult the recent beautiful book of Talagrand [288] 
for the latest advances. 

The Markov experts appearing in Section 8.6 were first considered by Feder, Merhav, 
and Gutman [95]. See also Cesa-Bianchi and Lugosi [51], where the lower bounds of 
Section 8.6 appear. We refer to [51] for more information on upper and lower bounds for 
general classes of experts. 


8.1 


8.2 


8.3 


8.4 


8.5 


8.6 
8.7 


8.8 


8.9 


8.10 


88 Exercises 245 


8.8 Exercises 


Calculate V®, VO, VS, and Vs. Calculate Vi (F), V2(F), and V3(F) when F contains two 
experts fı, = Oand fo, = 1 for all t. 


Prove or disprove 


sup V,(F)= Vi), 
F:|F\|=N 


Prove Lemma 8.1. Hint: Proceed with a backward induction, starting with t = n. Visualizing 
the possible sequences of outcomes in a rooted binary tree may help. 


Use Lemma 8.1 to show that if F is a class of static experts, then the optimal forecaster of 
Section 8.3 may be written as 


1 inf L —loy"-') — inf L wlyyrt 
POTD = 5 +E E FEF fO )— in fEF sO ] , 


2 
where Y4, ..., Y, are i.i.d. Bernoulli (1/2) random variables (Chung [60]). 
Give an appropriate modification of the optimal prediction algorithm of Section 8.2 for the case 


of general “non-simulatable” experts; that is, give a prediction algorithm that achieves the worst- 
case regret Vn ), (See Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire, and Warmuth [48] 
and Chung [60].) Warning: This exercise requires work. 


Show that Lemma 8.1 and Theorem 8.1 are not necessarily true if the experts in F are not static. 


Show that for the class of experts discussed in Section 8.4 


ATE 


lim = 3 
noo Jn /2n 
Consider the simple class of experts described in Section 8.4. Feder, Merhav, and Gutman [95] 
proposed the following forecaster. Let p, denote the fraction of times outcome 1 appeared in 
the sequence y;,..., y;-1. Then the forecaster is defined by P, = W;(p;), where for x € [0, 1], 


0 if x < 1/2 — €, 
W(x) = 1 ifx > 1/2+¢e, 
1/2+ (x — 1/2)/@e,;) otherwise 


and £; > 0 is some positive number. Show that if e, = 1/(2./t + 2), then for all n > 1 and 
y” € {0, 1}”, 


x : 1 
L(y") — min L;(y") <vVn+14+ 5 
(See [95].) Hint: Show first that among all sequences containing nı < n/2 1’s the forecaster 


performs worst for the sequence which starts alternating 0’s and 1’s n; times and ends with 
n — 2n; 0’s. 


After reading the previous exercise you may wonder whether the simpler forecaster defined by 
0 ifx <1/2 
w(x) = 1 ifx>1/2 
1/2 ifx = 1/2 


also does the job. Show that this is not true. More precisely, show that there exists a sequence 
y” such that To”) — min;=1,2 L;(y") ~ n/4 for large n. (See [95].) 

Let F be the class of all static experts of the form f, = p regardless of t. The class contains all 
such experts with p € [0, 1]. Estimate the covering numbers N2(F,r) and compare the upper 
and lower bounds of Theorems 8.2 and 8.4 for this case. 
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Use Theorem 8.4 to show that the upper bound derived for V,,(F) in Example 8.1 is tight up to 
a constant. 

Consider the class F of all static experts that predict in a monotonic way; that is, for each 
f € Ff, either fi < fiz: forallt > lor fi > fia: for allt > 1. Apply Theorem 8.2 to conclude 
that 


VF) = 0 (Vnlogn). 


What do you obtain using Theorem 8.4? Can you apply Theorem 8.5 in this case? 
Construct a forecaster such that for all k = 1,2,... and all sequences y1, y,..., 
1 (x 

li —(L(y")—- inf LQ”) | =0. 

isip ( G= t EO ) 
In other words, the forecaster predicts asymptotically as well as any Markov forecaster of any 
order. (The existence of such a forecaster was shown by Feder, Merhav, and Gutman [95].) 
Hint: For an easy construction use the doubling trick. 
Show that 

VM) 1 


lim inf lim inf E 


k> n>% /2knIn2 v2 
Hint: Mimic the proof of Corollary 8.3. 


9 


Logarithmic Loss 


9.1 Sequential Probability Assignment 


This chapter is entirely devoted to the investigation of a special loss function, the loga- 
rithmic loss, sometimes also called self-information loss. The reason for this distinguished 
attention is that this loss function has a meaningful interpretation in various sequential 
decision problems, including repeated gambling and data compression. These problems are 
briefly described later. Sequential prediction aimed at minimizing logarithmic loss is also 
intimately related to maximizing benefits by repeated investment in the stock market. This 
application is studied in Chapter 10. 

Now we describe the setup for the whole chapter. Let m > 1 be a fixed positive integer 
and let the outcome space be Y = {1,2,...,m}. The decision space is the probability 
simplex 


D= {p= (P0)... pm): pO =1, p20, fats m} CR". 
j=1 


A vector p € D is often interpreted as a probability distribution over the set VY. Indeed, 
in some cases, the forecaster is required to assign a probability to each possible outcome, 
representing the forecaster’s belief. For example, weather forecasts often take the form “the 
possibility of rain is 40%.” 

In the entire chapter we consider the model of simulatable experts introduced in Sec- 
tion 2.9. Thus, an expert f is a sequence (f1, f2, . . .) of functions f, : YT! > D, so that after 
having seen the past outcomes y’~!, expert f outputs the probability vector f,(- | y7!) € D. 
Aft = 1, fi = (fil), ..., fi(m)) is simply an element of D.) For the components of this 
vector we write 


FAIT,- fmn yD. 


This notation emphasizes the analogy between an expert and a probability distribution. 
Indeed, the jth component of the vector f,(j | y’~!) may be interpreted as the conditional 
probability f assigns to the jth element of Y given the past y’—!. 

Similarly, at each time instant the forecaster chooses a probability vector 


PE IyTD = (PA 1 y7, ..., Pan | y7). 
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Once again, using the analogy with probability distributions, we may introduce, for all 
n> land y” € y”, the notation 


fro" =[[AG 1D, hoD = [RO Ly’). 


t=1 t=1 
Observe that 
LAOS) P0”) =1 
yr ey" yey" 


and therefore expert f, as well as the forecaster, define probability distributions over the 
set of all sequences of length n. Conversely, any probability distribution p, (y”) over the 
set Y” defines a forecaster by the induced conditional distributions 


Pi(y') 


POLY Se 
ee Pr—-1(y!) 


where Pi O”) = yy ee Pn Q”). 
The loss function we consider throughout this chapter is defined by 


1 1 
n—~ = ln ; 
Ps) pO) 


€p. y) = J psn! yey,peD. 
j=l 


It is clear from the definition of the loss function that the goal of the forecaster is to assign 
a large probability to the outcomes in the sequence. For an outcome sequence yj,..., Yn 
the cumulative loss of expert f and the forecaster are, respectively, 


n 1 2 n 
Ee De rreaeren and L(y") = on 


t=1 =I. 


Ply: y5 


The cumulative loss of an expert f may also be written as L f(y”) = — In fah”). (Similarly, 
L(y") = — InP, (y").) In other words, the cumulative loss is just the negative log likelihood 
assigned to the outcome sequence by the expert f. Given a class F of experts, the difference 
between the cumulative loss of the forecaster and that of the best expert, that is, the regret, 
may now be written as 


n 


A 7 1 1 
L(y") — inf L p(y") = X` ln ——— — inf X` In ——_ 
fer T Ds PrO | y7) nee fix | yy) 


t=1 t=1 


z hO 
= sup In = ; 
feF Pn”) 


The regret may be interpreted as the logarithm of the ratio of the total probabilities that are 
sequentially assigned to the outcome sequence by the forecaster and the experts. 


Remark 9.1 (Infinite alphabets). We restrict the discussion to the case when the out- 
come space is a finite set. Note, however, that most results can be extended easily to the 
more general case when Y is a measurable space. In such cases the decision space becomes 
the set of all densities over Y with respect to some fixed common dominating measure, and 
the loss is the negative logarithm of the density evaluated at the outcome. 
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We begin the study of prediction under the logarithmic loss by considering mixture 
forecasters, the most natural predictors for this case. In Section 9.3 we briefly describe 
two applications, gambling and sequential data compression, in which the logarithmic loss 
function appears in a natural way. These applications have served as a main motivation for 
the large body of work done on prediction using this loss function. A special property of 
the logarithmic loss is that the minimax optimal predictor can be explicitly determined for 
any class F of experts. This is done in Section 9.4. In Sections 9.6 and 9.7 we discuss, in 
detail, the special case when F contains all constant predictors. Two versions of mixture 
forecasters are described and it is shown that their performance approximates that of 
the minimax optimal predictor. Section 9.8 describes another phenomenon specific to the 
logarithmic loss. We obtain lower bounds conceptually stronger than minimax lower bounds 
for some special yet important classes of experts. In Section 9.9 we extend the setup by 
allowing side information taking values in a finite set. In Section 9.10 a general upper 
bound for the minimax regret is derived in terms of the geometrical structure of the class 
of experts. The examples of Section 9.11 show how this general result can be applied to 
various special cases. 


9.2 Mixture Forecasters 


Recall from Section 3.3 that the logarithmic loss function is exp-concave for n < 1 and 
Theorem 3.2 applies. Thus, if the class F of experts is finite and |F| = N, then the 
exponentially weighted average forecaster with parameter 7 = 1 satisfies 


L(y") — inf L(y") < InN 
Q”) a fr") Sin 
or, equivalently, 
a n 1 n 
PrO”) = — sup faQ”). 
N fef 


It is worth noting that the exponentially weighted average forecaster has an interesting 
interpretation in this special case. Observe that the definition 


Ejer fir | yt Yes) 
Ejer ro 
of the exponentially weighted average forecaster, with parameter n = 1, reduces to 
3 O: l yi! = rer fOr | YTD fs") = rer fio) 
t = S . 
È jer f1 0O07) È jer f-10"') 


Thus, the total probability the forecaster assigns to a sequence y” is just 


BO ly) = 


Be tii hast “ Z rer AO = È rer fn") 
PrO R Safe OW 


(recall that f(y°) is defined to be equal to 1). In other words, the probability distribution 
the exponentially weighted average forecaster P defines over the set Y” of all strings of 
length n is just the uniform mixture of the distributions defined by the experts. This is 
why we sometimes call the exponentially weighted average forecaster mixture forecaster. 
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Interestingly, for the logarithmic loss, with n = 1, the mixture forecaster coincides with 
the greedy forecaster, and it is also an aggregating forecaster, as we show in Sections 3.4 
and 3.5. 

Recalling that the aggregating forecaster may be generalized for a countable number 
of experts, we may consider the following extension. Let f®, f,... be the experts of 
a countable family F. To define the mixture forecaster, we assign a nonnegative number 
T; > 0 to each expert f € F such that X7; 2; = 1. Then the aggregating forecaster 
becomes 


Sat Ti OO; | yg DFG) 
Dr FOD 

i = (yt! 

epee Oly ee 


oo pk cn ) 
Tje 7 


BO ly) = 


Now it is obvious that the joint probability the forecaster P assigns to each sequence y” is 


[o6] 
ProD = J m FO. 
i=1 
Note that p indeed defines a valid probability distribution over Y”. Using the trivial bound 
Pay") = oe, m; FOO") = me fOQ”) for all k, we obtain, for all y” € Y”, 


L(y") <_ inf (Lm +In 2) , 
7132.5 Ti 
This inequality is a special case of the “oracle inequality” derived in Section 3.5. 

Because of their analogy with mixture estimators emerging in bayesian statistics, the 
mixture forecaster is sometimes called bayesian mixture or bayesian model averaging, and 
the “initial” weights 2; prior probabilities. Because our setup is not bayesian, we avoid 
this terminology. 

Later in this chapter we extend the idea of a mixture forecaster to certain uncountably 
infinite classes of experts (see also Section 3.3). 


9.3 Gambling and Data Compression 


Imagine that we gamble in a horse race in which m horses run repeatedly many times. In the 
tth race we bet our entire fortune on the m horses according to proportions P,(1), .. . , Di (m), 
where the P, (j) are nonnegative numbers with D P:(j) = 1. If horse j wins the rth race, 
we multiply our money bet on this horse by a factor of o,(j) and we lose it otherwise. The 
odds o,(j) are arbitrary positive numbers. In other words, if y, denotes the index of the 
winning horse of the tth race, after the tth race we multiply our capital by a factor of 


V Iy,<n Bi Dori) = PO). 


j=l 


To make it explicit that the proportions p;(j) according to which we bet may depend on 
the results of previous races, we write p;,(j | y’~!). If we start with an initial capital of C 
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units, our capital after races is 


C I] Pir | y)or(yr). 


t=1 


Now assume that before each race we ask the advice of a class of experts and our goal is 
to win almost as much as the best of these experts. If expert f divides his capital in the rth 
race according to proportions f;(j | y’~'), j = 1,...,m, and starts with the same initial 
capital C, then his capital after n races becomes 


CT [fOr | yor). 
t=1 


The ratio between the best expert’s money and ours is thus 


SUP per C ies SiOx | yor) i fro”) 


CT BO ly Yow) per Pn”) 


independently of the odds. The logarithm of this quantity is just the difference of the 
cumulative logarithmic loss of the forecaster P and that of the best expert described in 
the previous section. In Chapter 10 we discuss in great detail a model of gambling (i.e., 
sequential investment) that is more general than the one described here. We will see that 
sequential probability assignment under the logarithmic loss function is the key in the more 
general model as well. 

Another important motivation for the study of the logarithmic loss function has its roots 
in information theory, more concretely in lossless source coding. Instead of describing the 
sequential data compression problem in detail, we briefly mention that it is well known 
that any probability distribution f,, over the set Y” defines a code (the so-called Shannon— 
Fano code), which assigns, to each string y” € VY”, a codeword, that is, a string of bits 
of length A,(y") = [— log, fa(y”)|. Conversely, any code with codeword lengths A,,(y”), 
satisfying a natural condition (i.e., unique decodability), defines a probability distribution 
by fry") = 277097 E aey 274", Given a class of codes, or equivalently, a class F 
of experts, the best compression of a string y” is achieved by the code that minimizes the 
length [— log, fn(y")], which is approximately equivalent to minimizing the logarithmic 
cumulative loss L ¢(y"). Now assume that the symbols of the string y” are revealed one by 
one and the goal is to compress it almost as well as the best code in the class. It turns out 
that for any forecaster P, (i.e., sequential probability assignment) it is possible to construct, 
sequentially, a codeword of length about — log, D,(y”) using a method called arithmetic 
coding. The regret 


i" Fx 
— log, PO”) — inf (— log, AO») = — ( L(y") — inf L(y" 
og Pa”) int ( ogs fn(y")) a3 fo» int LQ ) 


is often called the (pointwise) redundancy of the code with respect to the class of codes 
F. In this respect, the problem of sequential lossless data compression is equivalent to 
sequential prediction under the logarithmic loss. 
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9.4 The Minimax Optimal Forecaster 


In this section we investigate the minimax regret, defined in Section 2.10 for the logarithmic 
loss. An important distinguishing feature of the logarithmic loss function is that the minimax 
optimal forecaster can be determined explicitly. This fact facilitates the investigation of the 
minimax regret, which, in turn, serves as a standard to which the performance of any 
forecaster should be compared. Recall that for a given class F of experts, and integer 
n > 0, the minimax regret is defined by 


ze su nly” 
V,(F) = inf sup (Zo — inf Ly") = inf sup In patos 
D yneyn fEF P yneyn Pa") 


If for a given forecaster P we define the worst-case regret by 


Vi, F) = sup (Zo — inf Lo”) ; 
yreyn JEF 

then V, (F) = infp V, (P, F). Interestingly, in the case of the logarithmic loss it is possible 

to identify explicitly the unique forecaster achieving the minimax regret. Theorem 9.1 

shows that the forecaster p* defined by the normalized maximum likelihood probability 

distribution 


SUP fer fry”) 
ney SUP pef Jaa”) 


PAW = 5 


has this property. Note that p% is indeed a probability distribution over the set Y”, and recall 
that this probability distribution defines a forecaster by the corresponding conditional 
probabilities p*(y, | y7»). 


Theorem 9.1. For any class F of experts and integer n > 0, the normalized maximum 
likelihood forecaster p* is the unique forecaster such that 


sup (Zo — inf L 70") = V,(F). 
yey fEF 


Moreover, p* is an equalizer; that is, for all y" € Y", 


m PLE FAO n sup ful") = Val. 


Pry") peter 


Proof. First we show the second part of the statement. Note that by the definition of p*, 
its cumulative loss satisfies 


supper faQ”) 
Diy") 
=In D> sup f(x"), 


wren SEF 


L(y") — inf L(y”) = 1 
(y") m f¢Q") = In 


which is independent of y”, so p* is indeed an equalizer. 
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To show that p* is minimax optimal, let p # p* be an arbitrary forecaster. Then since 


Dynegy PrO”) = Viyneyn PhO”) = 1, for some y” € Y” we must have p(y") < pO”). 
But then, for this y”, 


sup rer fn(y") sup per fn(y") 
n >In 
PrO”) PO”) 


= const. = V, (p*, F) 
by the equalizer property. Hence, 


su nly” 
inept em. 
yreyn Pr”) 


which proves the theorem. W 


In Section 2.10 we show that, under general conditions satisfied here, the maximin regret 


su nO”) 
U, (F) = sup inf 5 {Gi areno 
q4 P yreyr PrO”) 


equals the minimax regret V, (F). It is evident that the normalized maximum likelihood 
forecaster p* achieves the maximin regret as well, in the sense that for any probability 
distribution q over y”, 


ae, S faly”) 
U,(F) = sup 5 q Jn A i, 
d yneyn Ply") 


Even though we have been able to determine the minimax optimal forecaster explicitly, 
note that the practical implementation of the forecaster may be problematic. First of all, 
we determined the forecaster via the joint probabilities it assigns to all strings of length n, 
and calculation of the actual predictions p* (y; | y’~!) involves sums of exponentially many 
terms. 

It is important to point out that previous knowledge of the total length n of the sequence 
to be predicted is necessary to determine the minimax optimal forecaster p*. Indeed, it is 
easy to see that if the minimax optimal forecaster is determined for a certain horizon n, 
then the forecaster is not the extension of the minimax optimal forecaster for horizon n — 1, 
even for nicely structured classes of experts. (See Exercise 9.4.) 

Theorem 9.1 not only describes the minimax optimal forecaster but also gives a use- 
ful formula for the minimax regret V, (F), which we study in detail in the subsequent 
sections. 


9.5 Examples 


Next we work out a few simple examples to understand better the behavior of the minimax 
regret V, (F). 
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Finite Classes 
To start with the simplest possible case, consider a finite class of experts with |F| = N. 
Then clearly, 


Vi(F) =In D> sup fO”) 


yreyr FEF 


aia $ 0” 


yrey" feF 
=i 20) 
SEF yey” 
= InN. 


Of course, we already know this. In fact, the mixture forecaster described in Section 9.1 
achieves the same bound. In Exercise 9.3 we point out that this upper bound cannot be 
improved in the sense that there exist classes of N experts such that the minimax regret 
equals In N. With the notation introduced in Section 2.10, V = InN (ifn > log, N). 

The mixture forecaster has obvious computational advantages over the normalized max- 
imum likelihood forecaster and does not suffer from the “horizon-dependence” of the latter 
mentioned earlier. These bounds suggest that one does not lose much by using the simple 
uniform mixture forecaster instead of the optimal but horizon-dependent forecaster. We 
will see later that this fact remains true in more general settings, even for certain infinite 
classes of experts. However, as it is pointed out in Exercise 9.7, even if F is finite, in 
some cases V, (F) may be significantly smaller than the worst-case loss achievable by any 
mixture forecaster. 


Constant Experts 

Next we consider the class F of all experts such that f,(j | y’~!) = f(/) (with f(j) > 0, 
ae f(j) = 1) for each f € F and independently of t and y’~!. In other words, F 
contains all forecasters f, so that the associated probability distribution over Y” is a product 
distribution with identical components. Here we only consider the case when m = 2. 
This simplifies the notation and the calculations while all the main ideas remain present. 
Generalization to m > 2 is straightforward, and we leave the calculations as exercises. 

If m = 2, (i.e, Y = {1,2} and D = f(q,1 — q) € R? : q € [0, 1]}), each expert in 
F may be identified with a number q € [0, 1] representing q = f(1). Thus, this expert 
predicts, at each time t, according to the vector (q, 1 — q4) € D, regardless of t and the past 
outcomes y;,..., ¥;-1. We call this class the class of constant experts. Next we determine 
the asymptotic value of the minimax regret V,(F) for this class. 


Theorem 9.2. The minimax regret V,(F) of the class F of all constant experts over the 
alphabet Y = {1, 2} defined above satisfies 


VAS ie ie 
n =7 inn 5; mMm- Ens 
2 2"2 


where €n > Oas n > œ. 


9.5 Examples 255 
Proof. Recall that by Theorem 9.1, 
VF) = In $. sup fO”). 
yreyn JEF 


Now assume that the number of 1’s in the sequence y” € {1, 2}” is nı and the number of 
2’s is n2 = n — nı. Then for the expert f that predicts according to (q, 1 — q), we have 
fr") = "C1 = q)”. 

Then it is easy to see, for example, by differentiating the logarithm of the above expres- 
sion, that this is maximized for q = n/n, and therefore 


nare (2), 
yreyn 


Since there are (7) sequences containing exactly nı 1’s, we have 


ven nF (GN GN 


We show the proof of the upper bound V, (F) < 5 Inn + 5 In 5 + o(1), whereas the similar 
proof of the lower bound is left as an exercise. Recall Stirling’s formula 


~v2rn (=) e2 <n! < Van (=) eee 
e es e 

(see, e.g., Feller [96]). Using this to approximate the binomial coefficients, each term of 

the sum may be bounded as 


(2) (ay (@)" r 1 n pliant) 
nı n n J2n \ nin 


n—1 
n 1 
V (F) < In ( fe ere ye |, 
( 27 mar vrm 


Writing the last sum in the expression as 


Thus, 


-1 -1 
=) r 
n=l HEAT? n=l K a (1 a 2) 

we notice that it is just a Riemann approximation of the integral 


T 


1 
1 
—— d= 
Í vx(l — x) 


(see the exercises). This implies that limp — oo Da 1/4 /nın = m, and so 


Ea 
Vi(F) < In (a + o(1)) D 


as desired. W 
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Remark 9.2. (m-ary alphabet). In the general case, when m > 2 is a positive integer, 
Theorem 9.2 becomes 
1 n r(1/2)" 


m — 
Vi(F) = 5 In a7 + In Fan/D + o(1), 


where I" denotes the Gamma function (see Exercise 9.8). The fact that the minimax regret 
grows as a constant times Inn, the constant being the half of the number of “free param- 
eters,” is a general phenomenon. In Section 9.9 we discuss the class of Markov experts, a 
generalization of the class of constant experts discussed here, that also obeys this formula. 
We study V,,(F) in a much more general setting in Section 9.10. In particular, Corollary 9.1 
in Section 9.11 establishes a result showing that, under general conditions, V, (F) behaves 
like A Inn, where k is the “dimension” of class F. 


9.6 The Laplace Mixture 


The purpose of this section is to introduce the idea of mixture forecasters discussed in 
Section 9.2 to certain uncountably infinite classes of experts. In Section 3.3 we have 
already extended the exponentially weighted average forecaster over the convex hull of 
a finite set of experts for general exp-concave loss functions. Special properties of the 
logarithmic loss allow us to derive sharp bounds and to bring the mixture forecaster into a 
particularly simple form. 

For simplicity we show the idea for the class of all constant experts introduced in 
Section 9.5. Once again, we gain simplicity by considering only the case of m = 2. 
That is, the outcome space is Y = {1, 2} and D = fq, l-—qe R? : q € [0, 1}, so that 
each expert in F predicts, at each time rf, according to the vector (q, | — q) € D regard- 
less of ż and the past outcomes y1, ..., y;-1. Theorem 9.2 shows that V, (F) ~ $Inn + 
5 Ines 

In Section 9.5 we pointed out that, for finite classes of experts, the exponentially weighted 
average forecaster assigns, to each sequence y”, the average of the probabilities assigned 
by each expert; that is, P O”) = 4 yy FOG. 

This idea may be generalized in a natural way to the class of constant experts. As before, 
let nı and n2 denote the number of 1’s and 2’s in a sequence y”. Then the probability 
assigned to such a sequence by any expert in the class has the form q”'(1 — q)”. The 
Laplace mixture of these experts is defined as the forecaster that assigns, to any y” € {1, 2}”, 
the average of all these probabilities according to the uniform distribution over the class F; 
that is, 


1 
Py") = f "a-oa. 
0 
After calculating this integral, it will be easy to understand the behavior of the forecaster. 


Lemma 9.1. 


1 1 
ma = q)"dg = ——_.. 
[a aa aD) 
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Proof. We may prove the equality by a backward induction with respect to n;. If n; =n, 
we clearly have i q"dq = 1/(n + 1). On the other hand, assuming 


1 
q" a — qy2"'dg = ——_~ 
Í (n a It) 


and integrating by parts, we obtain 


1 1 
n n m= ntl m-1 
[ wa-omed = TE d -— q) dq 
0 nı +1 0 


n—n, 1 


el. Ge He) 
1 


+D) 


The first thing we observe is that the actual predictions of the Laplace mixture forecaster 
can be calculated very easily and have a natural interpretation. Assume that the number of 
1’s and 2’s in the past outcome sequence y’~! are t; and t2. Then the probability the Laplace 
forecaster assigns to the next outcome being | is, by Lemma 9.1, 


es 
BO) GG) _ a+ 
S t-11) 1 ae j 
Pry’) Toa) t+1 
Similarly, P;(2 | y’~!) = (t2 + 1)/(t + 1). We may interpret p;,(1 | y’~!) as a slight modifi- 
cation of the relative frequency f, /(t — 1). In fact, the Laplace forecaster may be interpreted 
as a “smoothed” version of the empirical frequencies. By smoothing, one prevents infinite 
losses that may occur if tf; = 0 or f = 0. All we need to analyze the cumulative loss of the 
Laplace mixture is a simple property of binomial coefficients. 


PAT = 


Lemma 9.2. For all 1 <k <n, 


Proof. If the random variables Y;,..., Y, are drawn i.i.d. according to the distribution 
PLY; = 1] = 1 — P[Y; = 2] = k/n, then the probability that exactly k of them equals 1 


is (£) (oF (ayes and therefore this last expression cannot be larger than 1. W 


Theorem 9.3. The regret of the Laplace mixture forecaster satisfies 


sup (Zo — inf Ly") = In(n + 1). 
y” e{1,2}" fEF 


Proof. Letn, and nz denote the number of 1’s and 2’s in the sequence y”. We have already 
observed in the proof of Theorem 9.2 that 


n n nij” (Ng\" 
sup g™(1—g" = (Z) (3V7. 


qe€[0,1] n 
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Thus, the regret of the forecaster for such a sequence is 
SUPge(0,11 7" (1 — q)” 
Jy a" — qdq 
Gan 


ny 


< In(n + 1) (by Lemma 9.2). 


L(y") — inf L(y”) =1 
(y") m fO”)=ln 


= ln (by Lemma 9.1) 


Equality is achieved for the sequence y” = (1, 1,...,1). E 


Thus, the extremely simple Laplace mixture forecaster achieves an excess cumulative 
loss that is of the same order of magnitude as that of the minimax optimal forecaster, 
though the leading constant is 1 instead of 1/2. In the next section we show that a slight 
modification of the forecaster achieves this optimal leading constant as well. 


Remark 9.3. Theorem 9.3 can be extended, in a straightforward way, to the general case 
of alphabet size m > 2. In this case, 


oe ' n+m-1 
sup { L(y") — inf LQ”) ) =In < (m — 1)ln(n + 1) 
yrey" fEF m—1 


(see Exercise 9.10). 


9.7 A Refined Mixture Forecaster 


With a small modification of the Laplace forecaster, we may obtain a mixture forecaster 
that achieves a worst-case regret comparable to that of the minimax optimal normalized 
maximum likelihood forecaster. The proof of Theorem 9.3 reveals that the Laplace mixture 
achieves the largest regret for sequences containing either very few 1’s or very few 0’s. 
Because the optimal forecaster is an equalizer (recall Theorem 9.1), a good forecaster should 
attempt to achieve a nearly equal regret for all sequences. This may be done by modifying 
the mixture so that it gives a slightly larger weight to those experts that predict well on these 
critical sequences. The idea, first suggested by Krichevsky and Trofimov, is to use, instead 
of the uniform weighting distribution, the Beta (1/2, 1/2) density 1/ (x /q(I = q)). (See 
Section A.1.9 for some basic properties of the Beta family of densities.) Thus, we define 
the Krichevsky—Trofimov mixture forecaster by 


a" =a)" a 
o t¥qI—q) 
where we use the notation of the previous section. It is easy to see, by a recursive argument 


similar to the one seen in the proof of Lemma 9.1, that the predictions of the Krichevsky— 
Trofimov mixture may be calculated by 


Pry") = 


x = ti + 1/2 
ction ee n 


a formula very similar to that obtained in the case of the Laplace forecaster. 
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The performance of the forecaster is easily bounded once the following lemma is 
established. 


Lemma 9.3. For all q € [0, 1], 
1 gm — q)” 1 /m\™ [mym 
eta E O 
o ayq -q) 2J/n \n n 
The proof is left as an exercise. On the basis of this lemma we immediately derive the 
following performance bound. 


Theorem 9.4. The regret of the Krichevsky—-Trofimov mixture forecaster satisfies 


a 1 
sup {| L(y") — inf L o) < -lnn +1n2. 
y”e{1,2}" ( fEF f 2, 


Proof. 


sUPge0,11 7" (1 — gq)” 
"@™(1—q)” a 
o z/qd—@q) 
EE 
n n 
| g” — q)” 
o my4- q) 1 
< In (2V7) (by Lemma 9.3) 


L(y") — inf L; =1 
Q”) m fQ") = In 


as desired. W 


Remark 9.4. The Krichevsky—Trofimov mixture estimate may be generalized to the class 
of all constant experts when the outcome space is Y = {1,...,m}, where m > 2 is an 
arbitrary integer. In this case the mixture is calculated with respect to the so-called 
Dirichlet(1/2, --- , 1/2) density 


T(m/2) 1 
$p) = —| [| — 
raja FS eG 
over the probability simplex D, containing all vectors p = (p), ats p(m)) e R” with 


nonnegative components and adding up to 1. As for the bound of Theorem 9.4, one may 
show that the worst-case regret of the obtained forecaster 


Pon = fT] paren) ap 
j=l 


(where n1, ..., Am denote the number of occurrences of each symbol in the string y”) is 
upper bounded by 
UT gee” YS eg eh 
——Inn+lIn n o(1). 
2 r(m/2) 2 
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This upper bound exceeds the minimax regret by just a constant mot In2. Moreover, the 
forecaster may easily be calculated by the simple formula 
ti + 1/2 


~ t-l . 
Se y =1,...,m, 
Pily) AFM i m 


where t; denotes the number of occurrences of i in y'~!. 


To understand the behavior of the Krichevsky—Trofimov forecaster, we derive a lower 
bound for its regret that holds for every sequence y” of outcomes. The following result 
shows that this mixture forecaster is indeed an approximate equalizer, since no matter what 
the sequence y” is, its regret differs from tlnn by at most a constant. (Recall that the 
minimax optimal forecaster is an exact equalizer.) 


Theorem 9.5. For all outcome sequences y” € {0, 1}", the regret of the Krichevsky- 
Trofimov mixture forecaster satisfies 


a 1 
L(y") — inf L¢(y") = = 1 O(1). 
(y") iit, sO”) 5 nn+0(1) 


Proof. Fix an outcome sequence y”. As before, let nı and n2 denote the number of 
occurrences of 1’s and 2’s in y”. It suffices to derive a lower bound for the ratio of 
maxgejo,11 q9" (1 — q)” = (nı/n)™"(n2/n)™ and the Krichevsky—Trofimov mixture proba- 
bility Pa (y”). To this end, observe that this probability may be expressed in terms of the 
gamma function as 


r (nı +3)T (m+ 3) 
mn! 


Pn o”) = 
(see Section A.1.9). Thus, 


m n!ni™ n” 
F+ iP (m+ ian 


In order to investigate this quantity, introduce the function 


Ad 1 
L(y") — inf L p(y") — -Inn =1 
(y") me fO”) z ia ln 


m (ny + n)! n” n” 


F 3 = 
(nı, n2) F re LT (n2 + +) (ny + nyt /ny + n2 


(9.1) 


defined for all pairs of positive integers nı, n2. A straightforward calculation, left as an 
exercise, shows that F is decreasing in both of its arguments. Hence, F (n1, n2) > F (n,n), 
and therefore 

n (2n)! n” 


T (n + 1) On Jn 


a 1 
L(y") — inf L ¢(y") — -Inn > 1 
Q”) mi sO”) 5 nn zin 


Using Stirling’s approximation (x) = v27 (x/e)* [yx + o(1)) as x —> œ, it is easy 
to see that the right-hand side converges to a positive constant as n > oo. W 
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9.8 Lower Bounds for Most Sequences 


In Section 9.4 we determined the minimax regret V,,(F) exactly for any arbitrary class F of 
experts. The definition of V,(F) implies that for any forecaster P, there exists a sequence 
y” € Y” such that the regret L(y") — inf rer L f(y”) is at least as large as V, (F). In this 
section we point out that in some cases one may obtain much stronger lower bounds. In 
fact, we show that for some classes F, no matter what the forecaster P, is, the regret cannot 
be much smaller than V,,(F) for “most” sequences of outcomes y”. This indicates that the 
minimax value is achieved not only for an exceptional unfortunate sequence of outcomes, 
but in fact for “most” of them. (Of course, since the minimax optimal forecaster is an 
equalizer, we already knew this for p*. The result shown in this section indicates that all 
forecasters share this property.) What we mean by “most” will be made clear later. We just 
note here that the word “most” does not directly refer to cardinality. 

In this section, just like in the previous ones, we focus on the special case of binary 
alphabet Y = {1, 2} and the class F of constant experts. This is the simplest case for which 
the basic ideas may be seen in a transparent way. The result obtained for this simple class 
may be generalized to more complex cases such as constant experts over an m-ary alphabet, 
Markov experts, and classes of experts based on finite state machines. Some of these cases 
are left to the reader as exercises. 

Thus, the class we consider F contains all probability distributions on {1, 2}” that assign, 
to any sequence y” € {1, 2}", probability g/(1 — q)"-/, where j is the number of 1’s in 
the sequence and q € [0, 1] is the parameter determining the expert. Recall that for such a 
sequence the best expert assigns probability 


j = j J n -j n—j 
max 1— Pe 
dont ( 4) (4) ( n ) 


and therefore inf rer L f(y") = —j ln Å — (n — j)In™, 
Next we formulate the result. To this end, we partition the set {1, 2}” into n + 1 classes 
of types according to the number of 1’s contained in the sequence. To this purpose, define 


the sets 


f= fy” € {1,2}" : the number of 1’s in y” is exactly i}, j=0,1,...,n. 


In Theorem 9.6 think of 6, as a small positive number, and ¢, < 6, even smaller but 
still not too small. For example, one may consider 6, ~ 1/lnlnn and ¢, ~ 1/lnn or 
ôn ~ 1/VInInn and €, ~ 1/InInn to get a meaningful result. 


Theorem 9.6. Consider the class F of constant experts over the binary alphabet Y = {1, 2}. 
Let €, be a positive number, and consider an arbitrary forecaster Dn. Define the set 


PA 1 C 
A=4y" E41, 2)": LO”) < inf LO) + lng — In — 
fy" eaz O9 < inf LQ") + 5 Inn nc} 
where C = J e!/°/./2 ~ 1.4806. Then, for any ôn > 0, 


f ANII =a) ey 
|T;| ôn 


If €, is small, the set A contains all sequences for which the regret of the forecaster 
Pn is significantly smaller than the minimax value V,(F) ~ $Inn (see Theorem 9.2). 
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Theorem 9.6 states that if €, < ôn, the vast majority of types T; are such that the proportion 
of sequences in T; falling in A is smaller than ô. 


Remark 9.5. The interpretation we have given to Theorem 9.6 is that for any forecaster, the 
regret cannot be significantly smaller than the minimax value for “most” sequences. What 
the theorem really means is that for the majority of classes T; the subset of sequences of T; 
for which the regret is small is tiny. However, the theorem does not imply that there exist few 
sequences in {1, 2}” for which the regret is small. Just note that a relatively small number 
of classes of types (say, those with j between n/2 — 10./n and n/2 + 10./n) contain the 
vast majority of sequences. Indeed, the forecaster that assigns the uniform probability 2~” 
to all sequences will work reasonably well for a huge number of sequences. This example 
shows that the notion of “most” considered in Theorem 9.6 may be more adequate than just 
counting sequences. 


Proof. First observe that the logarithm of the cardinality of each class T; may be bounded, 
using Stirling’s formula, by 


n 
In |T;| = in( ) 
J 
J n-j 
To ea 
j hag 2x j(n— j)e!/® 


n\i n\"s 1 
> In{ - - Inn —InC, 
j n—-j 2 


where at the last step we used the inequality y j (n — j) < n/2. Therefore, for any sequence 
y” € T;, we have 


x n 1 
iO ) < In{T;|+ ae +1nc. 
This implies that if y” € A N T}, then 
= 1 
L(y") < In|Tj| + Inn — In — 


n 


or, in other words, 


PrO”) = 
ee |Tj|nép 


To finish the proof, note that 


[zs ARTA. 5H] s 1 |ANT;| 


IT ;| bne CITI 
1 1 
bn sa ITO") 


(where T (y") denotes the set T; containing y”) 
NEn Bea 
S SS PrO”) 

EA 


IA 


NEn 
Ôn 
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The purpose of this section is to extend the framework of prediction when the forecaster 
has access to certain “side information.” At each time instant, before making a prediction, 
a side information symbol is revealed to the forecaster. In our setup this side information 
is completely arbitrary, it may contain any external information and it may even depend 
on the sequence to be predicted. In this section we restrict our attention to the case when 
side information comes from a finite set. In Chapter 11 we develop a general framework of 
prediction with side information when the side information is a finite-dimensional vector, 
and the class of experts contains linear functions of the side information. The formal setup 
is the following. 

We consider prediction of sequences taking values in a finite alphabet Y = {1,..., m}. 
Let K be a positive integer, and let G|,..., Gx be “base” classes of static forecasters. (We 
assume the static property to lighten notation, the definitions and the results that follow can 
be generalized easily.) The class G; contains forecasters of the form 


go") =] [P oo, 


t=1 


where foreacht = 1, ..., n the vector (eu ) (1) gt! " (m )) is an element of the probability 
simplex D in R”. At each time f, a side information symbol z; € Z = {1,..., K } becomes 
available to the forecaster. The class F of forecasters against which our predictor competes 
contains all forecasters f of the form 


FO Ly 21) = fO |z) = 8.70), 


where for each j, t; is the length of the sequence of times s < t such that zs = j. In other 
words, each f ignores the past and uses the side information symbol z, to pick a class G,,. 
In this class a forecaster is determined on the basis of the subsequence of the past defined 
by the time instances when the side information coincided with the actual side information 
z,. Note that in some sense f is static, but z; may depend in an arbitrary manner on the 
sequence of past (or even future) outcomes. 

The loss of f € F for a given sequence of outcomes y” € Y” and side information 
z” EZ" is 


-$ in for Ly z) = — Dingo). 
t=1 t=1 


Our goal is to define forecasters P, whose cumulative loss 


-$ nA yT z) 
t=1 


is close to that of the best expert inf ex — X; In f(y; | y’~!, z+) for all outcome sequences 


y” e y” and side-information sequences z” € Z”. Assume that for each static expert class 
G; we have a forecaster g with worst-case regret 


n 


Q) 
, gr Or) 
Vaq”, Gj) = sup sup ERE, 

YEY" gEG; p=] qi Ox | NES ) 
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On the basis of these “elementary” forecasters, we may define, in a natural way, the 
following forecaster p with side information 


1 


ply Ly", z) = aE O 1 F), 


where for each j, y; denotes the sequence of y,’s (s < t) such that z, = j. In other words, 
the forecaster p looks back at the past sequence y;,..., y;-1 and considers only those 
time instants at which the side-information symbol agreed with the current symbol z;. The 
prediction of p is just that of ©” based on these past symbols. The performance of p may 
be bounded easily as follows. 


Theorem 9.7. For any outcome sequence y” € Y" and side information sequence z” € Z”, 
the regret of forecaster p with respect to all forecasters in class F satisfies 


“file | yt 20) G 
sup ) In ——_—_—_ < =) Vi, (q4, Gi), 
fer BiG Jo sy.) d 


where n; = X, I,=;} is the number of occurrences of symbol j in the side-information 
sequence. 


Proof. 


Air | y' =L 25) 


su In 
DD Pii | yT l sZt) 


t=1 
E) 
á 8 O) 
= sup 5 In mS (by definition of p) 
FEF E Oi lY) 


8P O) 
S 3 l=; 


I "GOLD | yi) 


K n g 
yr) 
= 5 sup > Iz=; In Fe 
=1 8G; 1 qP Or | y) 


Bee 


ae the expression within the parentheses depends only on gV’) 


SEn q” ,Gj). m 


The next simple example sheds some light on the power of this simple result. 


Example 9.1 (Markov forecasters). Let k > 1 and consider the class M+ of all kth order 
stationary Markov experts defined over the binary alphabet Y = {1, 2}. More precisely, 
M, contains all forecasters f for which the prediction at time ¢ is a function of the last 
k outcomes (y;-¢, .--, ¥r-1) (independent of t and outcomes ys for s < t — k). In other 
words, for a Markov forecaster one may write f,(y | y’~!) = fay | voy Such forecasters 
are also considered in Section 8.6 for different loss functions. As each prediction of a kth 
order Markov expert is determined by the last k bits observed, we add a prefix Y-k+1, - -< , Yo 
to the sequence to predict in the same way as we did in Section 8.6. 
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To obtain an upper bound for the minimax regret V (M+), we may use Theorem 9.7 
in a simple way. The side information z; is now defined as (y;-x,..., Y:-1), that is, z; 
takes one of K = 2* values. If we define G; =--- = Gx as the class G of all constant 
experts over the alphabet Y = {1, 2} defined in the previous sections, then it is easy to 
see that class F of all forecasters using the side information (y;_x, ..., Yr—1) is just class 
Mg. Let qP = - - - = q™ be the Krichevsky-Trofimov forecaster for class G, and define 
the forecaster f as in Theorem 9.7. Then, according to Theorem 9.7, for any sequence of 
outcomes y",,, the loss of f may be bounded as 


2k 
L(y") — inf LQ") < 2 Vz; (0). 
J= 


According to Theorem 9.4, 


1 
V,(G) < 5 Inn + In2. 


Using this bound, we obtain 
ay 
LQ") — inf LQ") < X` -Inz +2 n2 
(") — inf piatz Tj 


2k 


2k 
zonj] +2 n2 
j=1 


IA 
|= 


1 
-ln 
2 


N 


aa 42 In2 
— In — n2, 
2 °X 


where we used the arithmetic-geometric mean inequality. 
The upper bound obtained this way can be shown to be quite sharp. Indeed, the minimax 
regret V,,(M,) may be seen to behave like 2! In(n / 2") (see Exercise 9.13). 


9.10 A General Upper Bound 


Next we investigate the minimax regret V, (F) for general classes of experts. We derive a 
general bound that shows how the “size” of the class F affects the cumulative regret. 
To any class F of experts, we associate the metric d defined as 


d(f.8) = |$ sup(in fOr |») — Inge | yD)’. 


t=1 » 


Denote by N(F, £) the -covering number of F under the metric d. Recall that for any 
€ > 0, the e-covering number is the cardinality of the smallest subset F’ C F such that for 
all f € F there exists a g € F’ such that d( f, g) < £. The main result of this section is the 
following upper bound. 
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Theorem 9.8. For any class F of experts, 
Vi(F) < inf (inner, +24 f VinN(F, Das) : 
E> 0 


Note that if F is a finite class, then the right-hand side converges to In|F| as € > 0, 
and therefore we recover our earlier general bound. However, even for finite classes, the 
right-hand side may be significantly smaller than In |F|. Also, this result allows us to derive 
upper bounds for very general classes of experts. 

As a first step in the proof of Theorem 9.8, we obtain a weak bound for V, (F). This will 
be later refined to prove the stronger bound of Theorem 9.8. 


Lemma 9.4. For any class F of experts, 


D/2 
V, (F) < 24 vIn N(F, €) de, 
0 


where D = sup; gef d (f, g) is the diameter of F. 


Proof. Recall that, by the equalizer property of the normalized maximum likelihood fore- 
caster p* established in Theorem 9.1, for all y” € V”, V,(F) = SUP feF In ae Because 
the right-hand side is a constant function of y”, we may take any weighted average of it 
without changing its value. The trick is to weight the average according to the probability 


distribution defined by p*. Thus, we may write 


VF) = So (som n 3) zon 
yneyn JEF PO ) 


If we introduce a vector Y” = (Y1, ..., Y,) of random variables distributed according to 
p;., we obtain 


Vi(F) = E E In a 


fer DY") 


se “fi | YT) 
[ap om AO | 


n Y yt! Y y'-1 
<E sup J (m 4 2 |m SO l wa) ; 
RON PRA pm, (Y=) 
where the last step follows from the nonnegativity of the Kullback—Leibler divergence of 
the conditional densities, that is, from the fact that 


ly p(y; | yr-l = yl) si 

i aa Y, | yt-1 = yl = 
fX | yo) 

(see Section A.2). Now, for each f € F let 


n 


POEN Ora ames ah FÆIÐ aa 
SaS 2 G Pr Ly) [i mw, yy )) 


so we have V,(F) < 2E [supper Ts], where we write Ty = T;(Y"). 
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To obtain a suitable upper bound for this quantity, we apply Theorem 8.3. To do this, 
we need to show that the process {Ty > f EeF } is indeed a subgaussian family under 
the metric d. (Sample continuity of the process is obvious.) To this end, note that for any 
JEF, 


TOD- = DZ"), 


t=1 


where 


t 1 fOr | y!) | fŒ. | yt! = a) 
Zt = l l 
s ? 2 (1 g(r | yl) Š gY, | yolr yl) 


Now it is easy to see that Ty — T, = T (Y ”) — T,(Y") is a sum of bounded martingale 
differences with respect to the sequence Yj, Y2,..., Y,; that is, each term Z, has zero 
conditional mean and range bounded by 2d;(f, g). Then Lemma A.6 implies that, for all 
A> 0, 


2 
cients | < exp (Sar o?) 


Thus, the family {T;: f € F} is indeed subgaussian. Hence, using V,(F) < 
J [sup fer T;] and applying Theorem 8.3 we obtain the statement of the lemma. W 


N 


Proof of Theorem 9.8. To prove the main inequality, we partition F into small subclasses 
and calculate the minimax forecaster for each subclass. Lemma 9.4 is then applied in each 
subclass. Finally, the optimal forecasters for these subclasses are combined by a simple 
finite mixture. 

Fix an arbitrary € > 0 and let G = {g),..., gy} be an e-covering of F of minimal size 
N = N(F, £). Determine the subsets Fi, ..., Fy of F by 


FratfeF die) < d(f, gj) forall j =1,..., N}; 


that is, F; contains all experts that are closest to g; in the covering. Clearly, the union 
of Fi, ..., Fy is F. For each i = 1,..., N, let g denote the normalized maximum 
likelihood forecaster for F; 


SUP FEF, farO”) 
reyr SUP fer, fax”) 


Wyo) 
n ( )= 
oe 


Now let the forecaster pẹ be the uniform mixture of “experts” g“?,..., ¢. Clearly, 
Va (F) < inf..o Va(Ppe, F). Thus, all we have to do is to bound the regret of p,. To this end, 
fix any y” € Y” and let k = k(y") be the index of the subset Fy containing the best expert 
for sequence y”; that is, 


In sup f(y") = In sup f(y"). 
SEF SEF x 
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Then 
supper f(y") gy") sups, f(y") 
In =In In z 
Pe(y") PQ”) go”) 
On the one hand, by the upper bound for the loss of the mixture forecaster, 
(kyn 
sup In È Q” <lnN. 
y" Pe Q”) 
On the other hand, 
su g sup ¢. i 
ee < max supln supr, FO") = max V,(F;). 
y" gy") i=1..,N yn gO”) i=1,...,N 


Hence, we get 


ER 


Now note that the diameter of each element of partition F;,..., Fy is at most 2e. Hence, 
applying Lemma 9.4 to each F;, we find that 


Seer 


<NE, e) +24 f In N(F, ô)dô, 
0 
which concludes the proof. W 


Theorem 9.8 requires the existence of finite coverings of F in the metric d. But since 
the definition of d involves the logarithm of the experts’ predictions, this is only possible 
if all f,(j | y’~') are bounded away from 0. If the class of experts F does not satisfy this 
property, one may appeal to the following simple property. For simplicity, we state the 
lemma for the binary-alphabet case m = 2. Its extension to m > 2 is obvious. 


Lemma 9.5. Let m = 2, and let F be a class of experts. Define the class F® as the set of 
all experts f® of the form 


FO ly = (A Ly), 
where 


ô ifx <6 
T3(X) = x ifx € [6, 1 — ô] 
1-6 ifx>1-6 


for some fixed O < 5 < 1/2. Then V (F) < V,(F®) + 2nô. 


Thus, to obtain bounds for V, (F), we may first calculate a bound for the truncated class 
F® using Theorem 9.8 and then choose ô to optimize the right-hand side of the inequality 
of Lemma 9.5. The bound of the lemma is convenient but quite crude, and the resulting 
bounds are not always optimal. 


911 Further Examples 269 


Proof. Simply observe that for any sequence y”, and any f € F, 


In f(y") 
=> nfo ly) 


t=1 


< 5 In fOO Ly D+ 5 In] 


tf Only) <1-6 t: fP Ody =1—8 

ô -1 
< Yo mfPo ly Yo (ma -8) +28) 
tf Only) <1-6 t: fry! )=1-6 


(since In 1 < In(1 — ô) + 6/(1 — ô) by concavity, and using 0 < ô < 1/2) 


< So (In f° Ly) +28) 


t=1 


= In f(y") + 208. 
Thus, for any forecaster P, 
L(y") — inf LQ”) = —Inp,(y”) + sup In fa”) 
feF JEF 


< — Inpa") + sup In f(y”) + 2nd 
fOEFO 


= L(y") - vee Lyo(y")+ 2nd. W 


9.11 Further Examples 


Parametric Classes 
Consider first classes F such that there exist positive constants k and c satisfying, for 
alle > 0, 
c/n 
InN(F, £) < kin we 
E 
This is the case for most “parametric” classes, that is, classes that can be parameterized by 


a bounded subset of R* in some “smooth” way provided that all experts’ predictions are 
bounded away from 0. 


Corollary 9.1. Assume that the covering numbers of the class F satisfy the inequality 
above. Then 


k 
ViA(F) < 5 Inn + o(Inn). 


The main term f Inn is the same as the one we have seen in the case of the class of all 
constant and Markov experts. In those cases we could derive much sharper expressions for 
V, (F) even without requiring that the experts’ predictions be bounded away from 0. On 
the other hand, this corollary allows us to handle, in a simple way, much more complicated 
classes of experts. An example is provided in Exercise 9.18. 
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Proof of Corollary 9.1. Substituting the condition on covering numbers in the upper bound 
of Theorem 9.8, the first term of the expression is bounded by k lnn + klnc — k ln e. Then 
the second term may be bounded as follows: 


24 f JVInN(F, 5)d5 < asein f x 29 


(by substituting x = /In(c./n/6) 


and writing a, = V/In(c./n/e)) 


1 2 
= 48ceVk +f e d 
a. E Jaje J 
(by integrating by parts) 


1 
s sevin [5 Jaje E 


(by using the gaussian tail estimate 
SE edx < e? 20) 
< 48/ka,e (whenever ee < C/N) 


< 48/2e/kIn(cV/n) (whenever e > 1/(c/n).) 


The obtained upper bound is minimized for 


1 k 
~ 48/2) nevn) 


So, for every n so large that 


l 
c n > 48/2 a) 


we have 


7 
nev) 4+klne + 6k, 


k 
Vi(F) < ane R 5 in 


which proves the statement. W 


Nonparametric Classes 

Theorem 9.8 may also be used to handle much larger, “nonparametric” classes. In such 
cases the minimax regret V, (F) may be of a significantly larger order of magnitude than the 
logarithmic bounds characteristic of parametric classes. We work out one simple example 
here. 

Let Y = {1, 2} be a binary alphabet, and consider the class F of all experts f such that 
fd ly’ = fC) € [6, 1 — 6], where ô € (0, 1/2) is some fixed constant, and for each 
t=2,3,...,n, fil) > fr_-1C1). (The case when 6 = 0 may be treated by Lemma 9.5.) 
In other words, F® contains all static experts that assign a probability to outcome 1 in a 
monotonically increasing manner. To estimate the covering numbers of F, consider the 
finite subclass G of F containing only those monotone experts g that take values of the 
form g,(1) = 6+ (i /kX(1 — 26),i = 0,...,k, where k is a positive integer to be specified 
later. It is easy to see that |G| = (1) < < - nyt if k < n, and |G| < 2% otherwise. On the 
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other hand, for any f € F™, if g is the expert in G closest to f, then for each t < n, 


1 
l Lj ga = 
az n f(y) — Ing;(y)| < 5 w IFO) — g0) 


1 
glee) — D] 
1 
<—. 
~ êk 
Thus, d(f, g) < /n/(ék). By taking k = ./n/(68e), it follows that the covering number of 
F® is bounded as 
(ny6)  ife> zi 


N(F®, £) < 
( ) S 2vn/ (êe) otherwise. 


Substituting this bound into Theorem 9.8, it is a matter of straightforward calculation to 
obtain 


Va (F®) = O (n! 8 m? n). 


Note that the radius optimizing the bound of Theorem 9.8 is ¢ ~ n!/°5~!/3 In! n. Finally, 
by Lemma 9.5, if F is the class of all monotonically predicting experts, without restricting 
predictions in [6, 1 — ô], then by Lemma 9.5, V (F) < V, (F )) + Ind. By optimizing the 
value of ô in the upper bound obtained, we get V,(7) = O (n? ne n). 


9.12 Bibliographic Remarks 


The literature about predicting “individual sequences” under the logarithmic loss function 
has been intimately tied with a closely related “probabilistic” setup in which one assumes 
that the sequence of outcomes is generated randomly by one of the distributions in the 
class of experts. Even though in this chapter we do not consider the probabilistic setup 
at all, often one cannot separate the literature on the two problems, sometimes commonly 
known as the problem of universal prediction. The related literature is huge, and here we 
only mention a small selection of references. A survey summarizing a large body of the 
literature on prediction under the logarithmic loss is offered by Merhav and Feder [214]. 
The tight connection of sequential probability assignment and universal (lossless) source 
coding goes back to Kolmogorov [185] and Solomonoff [274, 275]. Fitingof [99, 100] and 
Davisson [79] were also among the pioneers of the field. 

The connection of sequential probability assignment and data compression with arith- 
metic coding was first revealed by Rissanen [236] and Rissanen and Langdon [243]. One 
of the most successful sequential coding methods, also applicable to prediction, is the 
Lempel—Ziv algorithm (see [197,319] and also Feder, Merhav, and Gutman [95]). 

The equivalence of sequential gambling and forecasting under the logarithmic loss 
function was noted by Kelly [180]; see also Cover [69] and Feder [94]. 

De Santis, Markowski, and Wegman [260] consider the logarithmic loss in the context 
of online learning. Theorem 9.1 is due to Shtarkov [267] just like Theorem 9.2; see also 
Freund [111], Xie and Barron [312]. The Laplace mixture forecaster was introduced, in 
the context of universal coding, by Davisson [79], and also investigated by Rissanen [239]. 
The refined mixture forecaster presented in Section 9.7 was suggested by Krichevsky and 
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Trofimov [186]. Lemma 9.3 appears in Willems, Shtarkov, and Tjalkens [311]. It is shown 
in Xie and Barron [312] and Freund [111] that the Krichevsky—Trofimov mixture, in fact, 
achieves a regret 5 Inn + 5 In 5 + o(1). This is optimal even in the additive constant for all 
sequences except for those containing very few 1’s or 2’s. Xie and Barron [312] refine the 
mixture further so that it achieves a worst-case cumulative regret of 5 Inn + 5 In 5 + o(1), 
matching the performance of the minimax optimal forecaster. Xie and Barron [312] also 
derive the analog of all these results in the general case m > 2. Theorem 9.5 also appears 
n [312], where the case of m-ary alphabet is also treated and the asymptotic constant is 
determined. Szpankowski [282] develops analytical tools to determine V,,(F) to arbitrary 
precision for the class of constant experts; see also Drmota and Szpankowski [90]. 

The material of Section 9.6 is based on the work of Weinberger, Merhav, and Feder [307], 
who prove a similar result in a significantly more general setup. In [307] an analog lower 
bound is shown for classes of experts defined by a finite-state machine with a strongly 
connected state transition graph. The work of Weinberger, Merhav, and Feder was inspired 
by a similar result of Rissanen [238] in the model of probabilistic prediction. 

Lower bounds for the minimax regret under general metric entropy assumptions may be 
obtained by noting that lower bounds for the probabilistic counterpart work in the setup of 
individual sequences as well. We mention the important work of Haussler and Opper [152]. 

Theorem 9.8 is due to Cesa-Bianchi and Lugosi [53], who improve an earlier result 
of Opper and Haussler [228] for classes of static experts. A general expression for the 
minimax regret, not described in this chapter, for certain regular parametric classes has 
been derived by Rissanen [242]. More specifically, Rissanen considers classes F of experts 
fa.o parameterized by an open and bounded set of parameters © C R*. It is shown in [242] 
that under certain regularity assumptions, 


V, (F) = Eni -+n f y det(7 (0)) dé + o(1), 


where the k x k matrix /(@) is the so-called Fisher information matrix, whose entry in 
position (i, j) is defined by 


ny O° In fao”) 
-p had ) 5508, 


where 6; is the ith component of vector 6. Yamanishi [313] generalizes Rissanen’s results 
to a wider class of loss functions. The expressions of the minimax regret for the class of 
Markov experts were determined by Rissanen [242] and Jacquet and Szpankowski [168]. 

Finally, we mention that the problem of prediction under the logarithmic loss has 
applications in the study of the general principle of minimum description length (MDL), first 
proposed by Rissanen [237, 238, 241]. For quite exhaustive surveys see Barron, Rissanen, 
Yu [22], Griinwald [134], and Hansen and Yu [143]. 


9.13 Exercises 


9.1 Let F be the class of all experts such that for each f € F, f,(j | y1) = f(/) (with fF(j) = 0, 
Ee ı FCJ) = 1) independently of t and y'~!. For a particular sequence y” € Y”, determine the 
best expert and its cumulative loss. 


9.2 


9.3 


9.4 


9.5 


9.6 


9.7 


9.8 


9.9 
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Assume that you want to bet in the horse race only once and that you know that the jth horse 
wins with probability p; and the odds are 0), ..., Om. How would you distribute your money to 
maximize your expected winnings? Contrast your result with the setup described in Section 9.3, 
where the optimal betting strategy is independent of the odds. 

Show that there exists a class F of experts with cardinality |F| = N such that for all n > log, N, 
V,(F) = InN. This exercise shows that the bound InN achieved by the uniform mixture 
forecaster is not improvable for some classes. 

Consider class F of all constant experts. Show that the normalized maximum likelihood fore- 
caster p% is horizon dependent in the sense that if p; denotes the normalized maximum likelihood 
forecaster for some t < n (i.e., p; achieves the minimax regret V,(F)), then it is not true that 


Yo PhO") = po». 


Yo eyr-t 


Let F be a class of experts and let q and P be arbitrary forecasters (i.e., probability distributions 
over VY”). Show that 


5 q(y”)In we > oe q(y") In SUP ex faQ”) ) 


n 
LEa meyi qQ”) 


and that 


er fa”) 
Saona Sere HO") _ y= Dein, 
jra” q0") 


where p* is the normalized maximum likelihood forecaster and D denotes Kullback—Leibler 
divergence. 


Show that 


—— dr= 
f Vx(1 — x) 


Hint: Substitute x by sin? a. 


Show that for every n > 1 there exists a class F, of two static experts such that if P denotes the 
exponentially weighted average (or mixture) forecaster, then 


VCD, Fn) = 
vea =O" 


for some universal constant c (see Cesa-Bianchi and Lugosi [53]). Hint: Let Y = {0, 1} and let 
F, contain the two experts f, g defined by f(1 | y7!) = i and g(1 | y!) = 4 + x. Show, on 
the one hand, that V,,(F,) < cjn~!/? and on the other hand that V, (P, Fn) = c2 for appropriate 
constants C1, C2. 
Extend Theorem 9.2 to the case when the outcome space is Y = {1,..., m}. More precisely, 
show that the minimax regret of the class of constant experts is 

1 n M1 /2)” m—1 


M 
n= =o a t Taa O 


ne 
ln — + o(1) 
m 


(Xie and Barron [312]). 


Complete the proof of Theorem 9.2 by showing that V, (F) > 2 x Inn +5 7 In 3 Z + o(1). Hint: The 
proof goes the same way as that of the upper bound, but to get the right Sonsan you need to be 
a bit careful when n/n or n2/n is small. 
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9.10 


9.11 


9.12 


9.13 


9.14 


9.15 


9.16 


9.17 


Logarithmic Loss 


Define the Laplace forecaster for the class of all constant experts over the alphabet Y = 
{1,2,...,m} for m > 2. Extend the arguments of Section 9.6 to this general case. In partic- 
ular, show that the worst-case cumulative regret of the uniformly weighted mixture forecaster 
satisfies 


Eon- eG ein t”! DI 1 
sup ( 09 = inf 1) =1( TER ) <o- )In(n + 1). 


yneyn 


Prove Lemma 9.3. Hint: Show that the ratio of the two sides decreases if we replace n by n + 1 
(and also increase either nı or n) by 1) and thus achieves its minimum when n = 1. Warning: 
This requires some work. 

Show that the function defined by (9.1) is decreasing in both of its variables. Hint: Proceed by 
induction. 


Show that the minimax regret V,,(M,) of the class of all kAth-order Markov experts over a binary 
alphabet VY = {1, 2} satisfies 


Jk n 
V, (Mp) = 5 In x +0(1) 


(Rissanen [242].) 

Generalize Theorem 9.6 to the case when Y = {1,..., m}, with m > 2, and F is the class of 
all constant experts. 

Generalize Theorem 9.6 to the case when Y = {1, 2} and F = Mh is the class of all kth-order 
Markov experts. Hint: You may need to redefine classes T; as classes of “Markov types” 
adequately. Counting the cardinality of these classes is not as trivial as in the case of constant 
experts. (See Weinberger, Merhav, and Feder [307] for a more general result.) 

(Double mixture for markov experts) Assume that Y = {1, 2}. Construct a forecaster P such 
that for any k = 1, 2,... and any kth-order Markov expert f € Mx, 


~ n 
sup (L(y")— LQ") < 51n 5g + Akn, 
u 70") 2 oF 


where for each k, lim sup,,_, ., @k,n < Be < œ. Hint: For each k consider the forecaster described 
in Example 9.1. Then combine them by the countable mixture described in Section 9.2 (see also 
Ryabko [252, 253]). 

(Predicting as well as the best finite-state machine) Consider Y = {1,2}. A k-state finite- 
state machine forecaster is defined as a triple (S, F, G) where S is a finite set of k elements, 
F : S — (0, 1] is the output function, and G : Y x S > S is the next-state function. For a 
sequence of outcomes y1, y2,..., the finite-state forecaster produces a prediction given by the 
recursions 


8:=G(S-1,y:-1) and f,(1 | y') = F(s,) 


for t = 2,3,... while f,(1) = F(s,) for an initial state sı € S. Construct a forecaster that 
predicts almost as well as any finite-state machine forecaster in the sense that 


sup (L(y") — Ly(y")) = Onn) 


y” ey” 


for every finite-state machine forecaster f. Hint: Use the previous exercise. (See also Feder, 
Merhav, and Gutman [95].) 


913 Exercises 275 


9.18 (Experts with fading memory) Let Y = {0, 1} and consider the one-parameter class F of 
distributions on {0, 1}" containing all experts f, with a € [0, 1], where each f is defined 
by its conditionals as ow = 1/2, fea | y1) = yı, and 


Eg a(2s — t) 
aq jyh ae ee 
F AJY) mO tes mo ) 
for all y'~! € {0, 1}'~! and for all t > 2. Show that 


Va (F) = Ollnn). 


Hint: First show using Theorem 9.8 that V,(F) < 1 lnn + 5 InIn va + In 1 + O(1) and then 
use Lemma 9.5. 
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Sequential Investment 


10.1 Portfolio Selection 


This chapter is devoted to the application of the ideas described in Chapter 9 to the problem 
of sequential investment. Imagine a market of m assets (stocks) in which, in each trading 
period (day), the price of a stock may vary in an arbitrary way. An investor operates on this 
market for n days with the goal of maximizing his final wealth. At the beginning of each 
day, on the basis of the past behavior of the market, the investor redistributes his current 
wealth among the m assets. Following the approach developed in the previous chapters, 
we avoid any statistical assumptions about the nature of the stock market, and evaluate 
the investor’s wealth relative to the performance achieved by the best strategy in a class of 
reference investment strategies (the “experts”’). 

In the idealized stock market we assume that there are no transaction costs and the 
amount of each stock that can be bought at any trading period is only limited by the 
investor’s wealth at that time. Similarly, the investor can sell any quantity of the stocks he 
possesses at any time at the actual market price. 

The model may be formalized as follows. A market vector X = (X1,..., Xm) for m assets 
is a vector of nonnegative real numbers representing price relatives for a given trading 
period. In other words, the quantity x; > 0 denotes the ratio of closing to opening price 
of the ith asset for that period. Hence, an initial wealth invested in the m assets according 
to fractions Q1, ..., Qm multiplies by a factor of eae x; Q; at the end of the period. The 
market behavior during n trading periods is represented by a sequence of market vectors 
x” = (X,,...,X,). The jth component of x,, denoted by xj, is the factor by which the 
wealth invested in asset j increases in the rth period. 

As in Chapter 9, we denote the probability simplex in R” by D. An investment strategy Q 
for n trading periods is a sequence Q}, ..., Q,, of vector-valued functions Q, : R l> D, 
where the ith component Q; (x7 1) of the vector Q,(x’—!) denotes the fraction of the current 
wealth invested in the ith asset at the beginning of the rth period on the basis of the past 
market behavior x’~!. We use 


Sn(Q,x")=]] z Xiu ouad) 
t=10 Viel 


to denote the wealth factor of strategy Q after n trading periods. The fact that Q, has 
nonnegative components summing to | expresses the condition that short sales and buying 
on margin are excluded. 


276 


10.1 Portfolio Selection 277 


Example 10.1 (Buy-and-hold strategies). The simplest investment strategies are the so 
called buy-and-hold strategies. An investor following such a strategy simply distributes 
his initial wealth among m assets according to some distribution Q; € D before the first 
trading period and does not trade anymore. The wealth factor of such a strategy, after n 
periods, is simply 


SQ, x = >> Oj] [xis 
j=l t=1 


Clearly, this wealth factor is at most as large as the gain max j=1,...,m TT 1 Xj Of the best 
stock over the same investment period and achieves this maximal wealth if Q; concentrates 
on this best stock. 


Example 10.2 (Constantly rebalanced portfolios). Another simple and important class of 
investment strategies is the class of constantly rebalanced portfolios. Such a strategy B 
is parameterized by a probability vector B = (B;,..., Bm) € D and simply Q,(x'"!) = B 
regardless of ¢ and the past market behavior x’~!. Thus, an investor following such a strategy 
rebalances, at every trading period, his current wealth according to the distribution B by 
investing a proportion Bı of this wealth in the first stock, a proportion Bz in the second 
stock, and so on. Observe that, as opposed to buy-and-hold strategies, an investor using a 
constantly rebalanced portfolio B is engaged in active trading in each period. The wealth 
factor achieved after n trading periods is 


S,B,x") =|] ( Xis si) , 

t=1 \i=1 

To understand the power of constantly rebalanced strategies, consider a simple market of 
m = 2 stocks such that the sequence of market vectors is (1, +) , (1, 2), (1, 5) 3 (ES 2)5 5 
Thus, the first stock maintains its value stable while the second stock is more volatile: on 
even days it doubles its price, whereas on odd days it loses half of its value. Clearly, on a 
long run, none of the two stocks (and therefore no buy-and-hold strategy) yields any gain. 
On the other hand, the investment strategy that rebalances every day uniformly (i.e., with 
B= (4, i) achieves an exponentially increasing wealth at a rate (9/8)"/*. The importance 
of constantly rebalanced portfolios is largely due to the fact that if the market vectors x; 
are realizations of an i.i.d. process and the number n of investment periods is large, then 
the best possible investment strategy, in a quite strong sense, is a constantly rebalanced 
portfolio (see Cover and Thomas [74] for a nice summary). 


As for the other models of prediction considered in this book, the performance of any 
investment strategy is measured by comparing it to the best in a fixed class of strategies. 
To formalize this notion, we introduce the worst-case logarithmic wealth ratio in the next 
section. In Section 10.3 the main result of this chapter is presented, which points out a 
certain equivalence between the problem of sequential investment and prediction under the 
logarithmic loss studied in Chapter 9. This equivalence permits one to determine the limits 
of any investment strategy as well as to design strategies with near optimal performance 
guarantees. In particular, in Section 10.4 a strategy called “universal portfolio” is introduced 
that is shown to be an analog of the mixture forecasters of Chapter 9. The so-called EG 
investment strategy is presented in Section 10.5 whose aim is to relieve the computational 
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burden of the universal portfolio. In Section 10.6 we allow the investor to take certain side 
information into account and develop investment strategies in this extended framework. 


10.2 The Minimax Wealth Ratio 


The investor’s objective is to achieve a wealth comparable to the best of a certain class of 
investment strategies regardless of the market behavior. Thus, given a class Q of investment 
strategies, we define the worst-case logarithmic wealth ratio of strategy P by 
n 
W,,(P, Q) = sup sup In SQ, 2") 
x” QcQ Si(P, x") 

Clearly, the investor’s goal is to choose a strategy P for which W,(P, Q) is as small 
as possible. W,,(P, Q) = o(n) means that the investment strategy P achieves the same 
exponent of growth as the best reference strategy in class Q for all possible market behaviors. 
The minimax logarithmic wealth ratio is just the best possible worst-case logarithmic wealth 
ratio achievable by any investment strategy P: 


Example 10.3 (Finite classes). Assume that the investor competes against a finite class 
Q={Q",..., QM} of investment strategies. A very simple strategy P divides the initial 
wealth in N equal parts and invests each part according to the “experts” Q. Then the total 
wealth of the strategy is 


N 
1 ; 
ny __ Gi) gan 
Sa (P, x") = N 2 Sa (Q, x") 
and the worst-case logarithmic wealth ratio is bounded as 
Sn (i) x" 
W,,(P, Q) = sup In ===- nv Sa(Q", x") 
x x =e 1 Sa (QO, x”) 


maxj=1,..,v Sa (QC, x”) 


max j= 1,..., N Sa (QO, x”) 


10.3 Prediction and Investment 


In this section we point out an intimate connection between the sequential investment 
problem and the problem of prediction under the logarithmic loss studied in Chapter 9. 
The first thing we observe is that any investment strategy Q over m assets may be used to 
define a forecaster that predicts the elements y, € Y = {1,...,m} of a sequence y” € Y” 
with probability vectors P, € D. To do this, we simply restrict our attention to those market 
vectors x that have a single component that is equal to 1 and all other components equal to 
0. Such vectors are called Kelly market vectors. Observe that a Kelly market is just like the 
horse race described in Section 9.3, with the only restriction that all odds are supposed to 
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be equal to 1. If x,,..., x, are Kelly market vectors and we denote the index of the only 
nonzero component of each vector x; by y;, then we may define forecaster f by 


FOS Y= Oya). 


We say that the forecaster f is induced by the investment strategy Q. With some abuse of 
notation we write S, (Q, y”) for S,(Q, x”) when x” is a sequence of Kelly vectors determined 
by the sequence y” of indices. Clearly, S (Q, y”) = f,(y”) if f is the forecaster induced 


by Q. 
To relate the regret in forecasting to the logarithmic wealth ratio, we use the logarithmic 
loss €(p,, y) = — In P; (yı | y’~!). Then the regret against a reference forecaster f is 


fn") _ 4, LO 
PrO”) P(y")’ 
where Q and P are the investment strategies induced by f and P (see Section 9.1). 


Now it is obvious that the investment problem is at least as difficult as the corresponding 
prediction problem. 


L, —Lyn =In 


Lemma 10.1. Let Q be a class of investment strategies, and let F denote the class of 
forecasters induced by the strategies in Q. Then the minimax regret 


V, (F) = inf sup sup In hO” 
Pn yn fef PrO”) 


satisfies W,(Q) > V, (F). 


Proof. Let P be any investment strategy and let p be its induced forecaster. Then 


Sa (Q, x") Sa (Q, y”) 
sup sup In ————— > max sup In ————— 
x” OcQ Sn(P, x") "EV" QeQ Si(P, y”) 
faQ”) 
= max sup ln 


YE feF PrO”) 
= V, (p, F) > V,(F). E 


Surprisingly, as it turns out, the investment problem is not genuinely more difficult than that 
of prediction. In what follows, we show that in many interesting cases the two problems 
are, in fact, equivalent in a minimax sense. 

Given a prediction strategy p, we define an investment strategy P as follows: 


5 t-1 t—1 T 
Da PG LY Pray (TT He) 
i t—1 : 
Deas Pry ) (ig w) 


Note that the factors []}_, xy,,, may be viewed as the return of the “extremal” investment 
strategy that, on each trading period ż, invests everything on the y,th asset. Clearly, the 
obtained investment strategy induces p, and so we will say that p and P induce each other. 

The following result is the key in relating the minimax wealth ratio to the minimax regret 
of forecasters. 


Psa) T 
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Theorem 10.1. Let P be an investment strategy induced by a forecaster p, and let Q be an 
arbitrary class of investment strategies. Then for any market sequence x", 


Sn(Q, x”) Mizi Qy,.0 


sup In ————— < max sup In 
QQ Sa (P, x") EY" QeQ Pnly") 


The proof of the theorem uses the following two simple lemmas. The first is an elementary 
inequality, whose proof is left as an exercise. 


Lemma 10.2. Let ai, ..., Gn, b1,..., bn be nonnegative numbers. Then 


where we define 0/0 = 0. 


Lemma 10.3. The wealth factor achieved by an investment strategy Q may be written as 
n n 
ORSEDD (1 sn (1 2,10") l 
yrey” \t=1 t=1 
If the investment strategy P is induced by a forecaster pn, then 


Sa(P, x)= Y) (re PrO”). 


yrey” \t=1 


Proof. First, we expand the product in the definition of S,(Q, x”): 


n 


SiOx) = J | dori Qi) 


t=1 \ j=l 


5 (1 ae 2,0") 


yrey" \t=1 
n n 
= >a (Ere) (1 2,10") : 
yrey" \t=1 t=1 


On the other hand, if an investment strategy is induced by a forecaster p, then 
n 


SAP.) = [| (XO pe P ie) 
1 


t=1 (j= 


pp Erion nO Pe TT) 
i eee Pr") (Ge a) 
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t 
t 
4 D TI, fja) Pi ) 

t—1 

= t—1 

Z A aia (as ka) Pi ) 

n 
T (Tie) a0 


yrey” \t=1 


t 


where in the last equality we set eyteyet (ie as) Pag’) =1fort=1. m 


Proof of Theorem 10.1. Fix any market sequence x” and choose any reference strategy 
Q' € Q. To simplify notation, we write S,(y", x") = []}_, y,,.- Then using the expressions 
derived above for S,(Q’, x”) and S,(P, x”), we have 
SO) Pg" (Oye) 
Sn(P, x") Vyneye SnO”, X”) PrO”) 
SrO”, x) Tiz QET 
< max 
Y”: Sa". x")>0 SnO”, X”) PaO”) 
(by Lemma 10.2) 
r1 Qh T) 
max ————___ 
yneyn Pn (y") 
n t-1 
< max sup Trai Qr) Oyu ) 
yneyn QQ Pn”) 


An important and immediate corollary of Theorem 10.1 is that the minimax logarithmic 
wealth ratio W,,(Q) of any class Q of static strategies equals the minimax regret associated 
with the class of the induced forecasters. To make this statement precise, we introduce 
the notion of static investment strategies, similar to the notion of static experts in pre- 
diction problems. A static investment strategy Q satisfies Q,(x’~') = Q, € D for each 
t = 1,...,n. Thus, the allocation of wealth Q, for each trading period does not depend on 
the past market behavior. 


Theorem 10.2. Let Q be a class of static investment strategies, and let F denote the class 
of forecasters induced by strategies in Q. Then 


W,,(Q) = V, (F). 
Furthermore, the minimax optimal investment strategy is defined by 
Pagar pO ais) 
Diyteyt Pra") (Tz Bue) 


where p* is the normalized maximum likelihood forecaster 


’ 


Poe) = 


n 
SUPgcQ Iia Qy. 
x : 
yreyn SUPgeg Ii Qy, 


P,Q") = x 
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Proof. By Lemma 10.1 we have W,,(Q) > V,(F); so it suffices to prove that W,(Q) < 
V, (F). Recall from Theorem 9.1 that the normalized maximum likelihood forecaster p* is 
minimax optimal for the class F; that is, 


n 
max In sup Tint Qy Qy. 


= V,(F). 
wey” geo PO”) 


Now let P* be the investment strategy induced by the minimax forecaster p* for Q. By 
Theorem 10.1 we get 


S,(Q, x" n 
W, (Q) < sup sup In Sn(Q, x") < max sup In Miz Qy 


< = V,(F). 
x” QcQ Sn(P*, x") EY" QeQ Dey") 


The fact that the worst-case wealth ratio supy: SUPgeq Sn(Q, X")/Sn(P*, x") achieved by 
the strategy P* equals W,,(Q) follows by the inequality above and the fact that V, (F) < 
W,(Q). E 


Example 10.4 (Constantly rebalanced portfolios). Consider now the class Q of all con- 
stantly rebalanced portfolios. It is obvious that the strategies of this class induce the “con- 
stant” forecasters studied in Sections 9.5, 9.6, and 9.7. Thus, combining Theorem 10.2 with 
the remark following Theorem 9.2, we obtain the following expression for the behavior of 
the minimax logarithmic wealth ratio: 


Md/2)" 
T(m/2) 
This result shows that the wealth S,,(P*, Q) of the minimax optimal investment strategy 
given by Theorem 10.2 comes within a factor of n™”-D/2 of the best possible constantly 


rebalanced portfolio, regardless of the market behavior. Since, typically, supgeg Sn(Q, x") 
increases exponentially with n, this factor becomes negligible on the long run. O 


wW,(Q)= "i Inn +1n + o(1). 


10.4 Universal Portfolios 


Just as in the case of the prediction problem of Chapter 9, the minimax optimal solu- 
tion for the investment problem is not feasible in practice. In this section we introduce 
computationally more attractive methods, close in spirit to the mixture forecasters of Sec- 
tions 9.6 and 9.7. 

For simplicity and for its importance, in this section we restrict our attention to class Q of 
all constantly rebalanced portfolios. Recall that each strategy Q in this class is determined 
by a vector B = (B1, ..., Bm) in the probability simplex D in R”. In order to compete with 
the best strategy in Q, we introduce the universal portfolio strategy P by 


Jp By S:1(B, x!~!)u(B) dB 
Jp S-18, x (B) dB 


where u is a density function on D. In the simplest case jz is just the uniform density, 
though we will see that it may be advantageous to consider nonuniform densities such 
as the Dirichlet(1/2,..., 1/2) density. In any case, the universal portfolio is a weighted 
average of the strategies in Q, weighted by their past performance. We will see in the 
proof of Theorem 10.3 below that the universal portfolio is nothing but the investment 


Pi’) = j=l,....m, t=1,...,n, 
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strategy induced by the mixture forecaster (the Laplace mixture in case of uniform jz and 
the Krichevsky—Trofimov mixture if jz is the Dirichlet(1/2, ..., 1/2) density). 

The wealth achieved by the universal portfolio is just the average of the wealths achieved 
by the individual strategies in the class. This may be easily seen by observing that 


n m 
SAP.) = [ [X Pie xj 
t=1 j=l 


t Jo er Xj Bj Si-1B, x!) u(B) dB 
Jp Sri 0B, x'~)u(B) dB 


"fy S:(B, x!) (B) dB 
L3 Jp Sı-1®, x!) (B) dB 


z / S, (B, x") u(B) dB 
D 


because the product is telescoping and Sọ = 1. This last expression offers an intuitive 
explanation of what the universal portfolio does: by approximating the integral by a Riemann 
sum, we have 


Sn(P,x") © Y Qi SaB, x"), 


where, given the elements A; of a fine finite partition of the simplex D, we assume that 
B; € A; and Q; = f a, 4(B)AdB. The right-hand side is the capital accumulated by a strategy 
that distributes its initial capital among the constantly rebalanced investment strategies B; 
according to the proportions Q; and lets these strategies work with their initial share. In 
other words, the universal portfolio performs a kind of buy-and-hold over all constantly 
rebalanced portfolios. 

This simple observation is the key in establishing performance bounds for the universal 
portfolio. 


Theorem 10.3. If u is the uniform density on the probability simplex D in R”, then the 
wealth achieved by the universal portfolio satisfies 
S, (B, x”) 


sup sup ln < (m — l)ln(n + 1). 
x” BeD Sa (P, x”) 


If the universal portfolio is defined using the Dirichlet (1/2, ..., 1/2) density u, then 
Sa (B, x") m-l! r(1/2)” m-1 


1 < 1 l In2 1). 
Pe Ee a a a 


The second statement shows that the logarithmic worst-case wealth ratio of the universal 
portfolio based on the Dirichlet (1/2,..., 1/2) density comes within a constant of the 
minimax optimal investment strategy. 


Proof. First recall that each constantly rebalanced portfolio strategy indexed by B is 
induced by the “constant” forecaster p®, which assigns probability 


Paty") = By! o Byrn 
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to each sequence y” in which the number of occurrences of symbol j isn; (j = 1,...,m). 
By Lemma 10.3, the wealth achieved by such a strategy is 


SB, x)=) (1 sw PrO”). 
y” t=1 


Using the fact that the wealth achieved by the universal portfolio is just the average of the 
wealths achieved by the strategies in Q, we have 


S,(P,x") = f S,(B, x”)u(B) dB 
D 
32 (LIe) f touw a. 
y” t=1 D 


The last expression shows that the universal portfolio P is induced by the mixture forecaster 


pay") = Í p(y") u(B) cB. 


(Simply note that by Lemma 10.3 the wealth achieved by the strategy induced by the 
mixture forecaster is the same as the wealth achieved by the universal portfolio; hence the 
two strategies must coincide.) 

By Theorem 10.1, 


Sn(B, x”) Pro") 
sup sup In max sup In 


se ae i 
x” BeD Sa (P, x") ~ yreyn BeD PrO”) 


In other words, the worst-case logarithmic wealth factor achieved by the universal portfolio 
is bounded by the worst-case logarithmic regret of the mixture forecaster. But we have 
already studied this latter quantity. In particular, if u is the uniform density, then p,, is just 
the Laplace forecaster whose performance is bounded by Theorem 9.3 (for m = 2) and by 
Exercise 9.10 (for m > 2), yielding the first half of the theorem. 

The second statement is obtained by noting that if the universal portfolio is defined 
on the basis of the Dirichlet(1/2,..., 1/2) density, then p, is the Krichevsky—Trofimov 
forecaster whose loss is bounded in Section 9.7. E 


10.5 The EG Investment Strategy 


The worst-case performance of the universal portfolio is basically unimprovable, but it has 
some practical disadvantages. Just note that the definition of the universal portfolio involves 
integration over an m-dimensional simplex. Even for moderate values of m, the exponential 
computational cost may become prohibitive. In this section we describe a simple strategy 
P whose computational cost is linear in m, a dramatic improvement. Unfortunately, the 
performance guarantees of this version are inferior to those established in Theorem 10.3 
for the universal portfolio. 


10.5 The EG Investment Strategy 285 


This strategy, called the EG investment strategy, invests at time t using the vector P, = 
(Pit, -.., Pm, t) where P; = (1/m,..., 1/m) and 


Pi 4~1 eX Xi t—1/Pi—1 + Xt— 
Pin = 1 EXP( NG aP) PD m, t=2,3,.... 


Di Piri exp(m(xj..-1/Pr—1 s X;_1)) 


This weight assignment is a special case of the gradient-based forecaster for linear regression 
introduced in Section 11.4: 


P; 1-1 exp(nV£;—1(Pr-1)i) 


Pit = <a 
O Eha Pia exp(nV 61 (P,-1))) 
when the loss functions is set as €;_;(P;_1) = — In P;—1 - x;_}. 
Note that with this loss function, L, = — lIn S(P, x”), where L, is the cumulative loss of 
the gradient-based forecaster, and L,(B) = — In S(B, x”), where L,,(B) is the cumulative 


loss of the forecaster with fixed coefficients B. Hence, by adapting the proof of Theo- 
rem 11.3, which shows a bound on the regret of the gradient-based forecaster, we can 
bound the worst-case logarithmic wealth ratio of the EG investment strategy. On the other 
hand, the following simple and direct analysis provides slightly better constants. 


Theorem 10.4. Assume that the price relatives x; all fall between two positive constants 
c < C. Then the worst-case logarithmic wealth ratio of the EG investment strategy with 


n = (c/C)/(8Inm)/n is bounded by 


lnm  nnC? C [n 
+ = lnm. 
n 8 ce? c\2 


Proof. The worst-case logarithmic wealth ratio is 


Ai B- x, 
max max In =~, 
x" BeD [[ Px 


where the first maximum is taken over market sequences satisfying the boundedness 
assumption. By using the elementary inequality In(1 + u) < u, we obtain 


n 
1PX = P,- x; 
n m 
P (Bi — Pit) Xis 
= P,- x 
t=1 i=l r 
n m m 
B. X jit Xi,t 
= J E pt 
: P, - x; : P, - x; 
{= j=l i= 
Introducing the notation 
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and noting that, under the boundedness assumption 0 < c < xj, < C, Z, € [0, C/c], we 
may rewrite the wealth ratio above as 


ner Sede lls DD Bi 


Because this expression is a linear function of B, it achieves its maximum in one of the 
comers of the simplex D, and therefore 


n n 
max In 21 B: x% < J J l. P; min ) g 
dhli ne AN 
BeD [Rx — rs it 


esj 


l jsl t=1 


Now note that 


> Witt exp (n Eii Gis/Ps $ x,)) exp (-n Yai A 
it = = = = ef ; z 
Wea Fesp (n ey Ps x,)) ia exp (Gu ees À 
Hence, P41, P2,... are the predictions of the exponentially weighted average forecaster 


applied to a linear loss function with range [0, C /c]. The regret 


ye P - min ye 


errs 


t=1 i=1 t=1 


can thus be bounded using Theorem 2.2 applied to scaled losses (see Section 2.6). (Note 
that this theorem also applies when, as in this case, the losses at time ¢ depend on the 
forecaster’s prediction P,-x,). E 


The knowledge of constants c and C can be avoided using exponential forecasters with 
time-varying potentials like the one described in Section 2.8. 


Remark 10.1. A linear upper bound on the worst-case logarithmic wealth ratio is inevitably 
suboptimal. Indeed, the linear upper bound 


m n m m m n 

5 Bj p2 (È Pastu) = n) = > Bj 5 a Pit (lie — t) 

j=1 t=1 \i=1 j=l ist \r=1 
is maximized for a constantly rebalanced portfolio B lying in a corner of the simplex D, 
whereas the logarithmic wealth ratio In TT ı (B - x,/P; - x;) is concave in B, and therefore it 
is possibly maximized in the interior of the simplex. Thus, no algorithm trying to minimize 
the linear upper bound on the worst-case logarithmic wealth ratio can be minimax optimal. 
Note also that the bound obtained for the worst-case logarithmic wealth ratio of the EG 
strategy grows as ./n, whereas that of the universal portfolio has only a logarithmic growth. 


The following simple example shows that the bound of the order of y/n cannot be improved 
for the EG strategy. Consider a market with two assets and market vectors x, = (1, 1/2) for 
allt. Then, for every wealth allocation P,, 1/2 < P, - x, < 1. The best constantly rebalanced 
portfolio is clearly (1, 0), and the worst-case logarithmic wealth ratio is 


1 n 
Soin IA > 2 Py, /2. 


t=1 
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In the case of the EG strategy, we may lower bound P2,, by 


exp (9 Do} 1 2P, yr) 
exp (1 Diz pic) + exp (n Dict 1 2P, wx) 
exp (- pit 1 2P, £) 


1 + exp (- pe 1 3, sy) 
_ exp(=n(t = D) 
= 2 
Thus, the logarithmic wealth ratio of the EG algorithm is bounded from below by 


Por = 


D exp(—n(t-—1)) 1 1-e™ 1 
ee By 


4 4 fei p 


t=1 


where the last approximation holds for large values of n. Since 77 is proportional to 1/./n, 
the worst-case logarithmic wealth ratio is proportional to y/n, a value significantly larger 
than the logarithmic growth obtained for the universal portfolio. 


10.6 Investment with Side Information 


The investment strategies considered up to this point determine their portfolio as a function 
of the past market behavior and the investment strategies in the comparison class. However, 
sometimes an investor may want to incorporate external information in constructing a 
portfolio. For example, the price of oil may have an effect on stock prices, and one may 
not want to ignore them even if oil is not traded on the market. Such arguments lead us to 
incorporating the notion of side information. We do this similarly as in Section 9.9. 

Suppose that, at trading period rt, before determining a portfolio, the investor observes 
the side information z;, which we assume to take values in a set Z of finite cardinality. 
For simplicity, and without loss of generality, we take Z = {1,..., K}. The portfolio 
chosen by the forecaster at time t may now depend on the side information z,. Formally, an 
investment strategy with side information Q is asequence of functions Q, : Ro! x Z—> D, 
t=1,...,n. Attimef, on observing the side information z,, the strategy uses the portfolio 
Q,(x'~!, z,). Starting with a unit capital, the accumulated wealth after n trading periods 
becomes 


n 
S.(Q,%",2") =| [ x QET, 2). 
t=1 
Our goal is to design investment strategies that compete with the best in a given reference 
class of investment strategies with side information. For simplicity, we consider reference 
classes built from static classes of investment strategies. More precisely, let Q),..., Ox 
be “base” classes of static investment strategies and let Q® € Q),..., 0 € Qk be 
arbitrary strategies. The class of investment strategies with side information we consider 
are such that 


=| = 
Q,(x' ’ Zt) R Qz, ’ 
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where Q/ € D is the portfolio of an investment strategy Q}? € Q; at the rth time period and 
nj; is the length of the sequence of those time instances s < t when zs = j (j = 1,..., K). 
In other words, a strategy in the comparison class assigns a static strategy OY” to any j = 
1,..., K and uses this strategy whenever the side information equals j. This formulation 
is the investment analogue of the problem of prediction with side information described in 
Section 9.9. 

In analogy with the forecasting strategies introduced in Section 9.9, we may consider 
the following investment strategy: let G,,..., Gg denote the classes of (static) forecasters 
induced by the classes of investment strategies Q4, . . . , Ox, respectively. Letg(,...,q¢ 
be forecasters with worst-case cumulative regrets (with respect to the corresponding refer- 
ence classes) 


n 


i) 3 PO) 
Vi(qy,G;) = sup sup ji eh, 
n yrey” g VEG; aA qP O, | y7!) 


On the basis of this, one may define the forecaster with side information in Section 9.9: 


1 


ply Ly z) = 46° 17) 


This forecaster now induces the investment strategy with side-information P,(x'—!,, z,) 
defined by its components 


zi 
AT ply: IT, (xy Pss | y5}, zs)) 


—1 
Deoa IL, (Xy,.s Ps Os | y, zs)) 


The following result is a straightforward combination of Theorems 9.7 and 10.1. The proof 
is left as an exercise. 


Pj, z) = 


Theorem 10.5. For any side-information sequence zı,...,Zn, the investment strategy 
defined above has a worst-case logarithmic wealth ratio bounded by 


K 
Sa (Q, xX", z”) 
sup sup In ————___ Va (q, G;) 
x" geo © SyCP sx", NEL j) 


If the base classes Q4, . . ., Qx all equal to the class of all constantly rebalanced portfolios 
and the forecasters q}? are mixture forecasters, then it is easy to see that the forecaster of 
Theorem 10.5 takes the simple form 


Jp Bj Sr., B. X,, |) 4(B) dB 
Ip S7, B, X, uB) dB 


Pk 2) j=1,...,m, t=1,...,A, 


where u is a density function on D and xy is the subsequence of the past market sequence 
x’! determined by those time instances s < t when z, = j. Thus, the strategy defined 
above simply selects the subsequence of the past corresponding to the times when the 
side information was the same as the actual value of side information z, and calculates a 
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universal portfolio over that subsequence. Theorem 10.5, combined with Theorem 10.3, 
implies the following. 


Corollary 10.1. Assume that the universal portfolio with side information defined above is 
calculated on the basis of the Dirichlet (1/2, ..., 1/2) density u. Let Q denote the class 
of investment strategies with side information such that the base classes Q),..., Qx all 
coincide with the class of constantly rebalanced portfolios. Then, for any side information 
sequence, the worst-case logarithmic wealth ratio with respect to class Q satisfies 


Sn(Q, x", 2") 
sup sup In ————_—— 
x" QeO Sa (P, x", 2") 
K(m—-1 ra/2)” K(m—-1 
gis egg ES Dab ena 
2 K T(m/2) 2 


10.7 Bibliographic Remarks 


The theory of portfolio selection was initiated by the influential work of Markowitz [209], 
who introduced a statistical theory of investment. Kelly [180] considered a quite different 
approach, closer to the spirit of this chapter, for horse race markets of the type described 
in Section 9.3, and assumed an independent, identically distributed sequence of market 
vectors. Breiman [41] extended Kelly’s framework to general markets with i.i.d. returns. 
The assumption of independence was substantially relaxed by Algoet and Cover [6], who 
considered stationary and ergodic markets; see also Algoet [4,5], Walk and Yakowitz [304], 
Gyorfi and Schafer [136], and Gyorfi, Lugosi, and Udina [135] for various results under 
such general assumptions. 

The problem of sequential investment of arbitrary markets was first considered by Cover 
and Gluss [71], who used Blackwell’s approachability (see Section 7.7) to construct an 
investment strategy that performs almost as well as the best constantly rebalanced portfolio 
if the market vectors take their values from a given finite set. The universal portfolio 
strategy, discussed in Section 10.4, was introduced and analyzed in a pioneering work of 
Cover [70]. The minimax value W,,(Q) for the class of all constantly rebalanced portfolios 
was found by Ordentlich and Cover [229]. The bound for the universal portfolios over 
the class of constantly rebalanced portfolios was obtained by Cover and Ordentlich [72]. 
The general results of Theorems 10.1 and 9.2 were given in Cesa-Bianchi and Lugosi [52]. 
The EG investments strategy was introduced and analyzed by Hembold, Schapire, Singer, 
and Warmuth [158]. Theorem 10.4 is due to them (though the proof presented here is taken 
from Stoltz and Lugosi [279]). 

The model of investment with side information described in Section 10.6 was introduced 
by Cover and Ordentlich [72], and Corollary 10.1 is theirs. Gyorfi, Lugosi, and Udina [135] 
choose the side information by nonparameteric methods and construct investment strategies 
with universal guarantees for stationary and ergodic markets. 

Singer [269] considers the problem of “tracking the best portfolio” and uses the tech- 
niques described in Section 5.2 to construct investment strategies that perform almost as 
well as the best investment strategy, in hindsight, which is allowed to switch between 
portfolios a limited number of times. 
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Kalai and Vempala [175] develop efficient algorithms for approximate calculation of the 
universal portfolio. 

Vovk and Watkins [303] describe various versions of the universal portfolio, considering, 
among others, the possibility of “short sales,” that is, when the portfolio vectors may have 
negative components (see also Cover and Ordentlich [73]). 

Cross and Barron [78] extend the universal portfolio to general smoothly parameterized 
classes of investment strategies and also extend the problem of sequential investment to 
continuous time. 

One important aspect of sequential trading that we have ignored throughout the chapter 
is transaction costs. Including transaction costs in the model is a complex problem that has 
been considered from various different points of view. We just mention the work of Blum 
and Kalai [33], Iyengar and Cover [167], Iyengar [166], and Merhav, Ordentlich, Seroussi, 
and Weinberger [215]. 

Borodin, El-Yaniv, and Gogan [37] propose ad hoc investment strategies with very 
convincing empirical performance. 

Stoltz and Lugosi [279] introduce and study the notion of internal regret (see Section 4.4) 
in the framework of sequential investment. 


10.8 Exercises 


10.1 Show that for any constantly rebalanced portfolio strategy B, the achieved wealth S,,(B, x”) is 
invariant under permutations of the sequence X4, . . . , Xj. 


10.2 Show that the wealth S,,(P,x”) achieved by the universal portfolio P is invariant under 
permutations of the sequence x, ..., X,. Show that the same is true for the minimax optimal 
investment strategy P* (with respect to the class of constantly rebalanced portfolios). 

10.3 (Universal portfolio exceeds value line index) Let P be the universal portfolio strategy 
based on the uniform density u. Show that the wealth achieved by P is at least as large as the 
geometric mean of the wealth achieved by the individual stocks, that is, 


1/m 
m 


SP, x) > [TTT [x 


j=l t=1 


(Cover [70].) Hint: Use Jensen’s inequality twice. 

10.4 Let F be a finite class of forecasters. Show that the investment strategy induced by the mixture 
forecaster over this class is just the strategy described in the first example of Section 10.2 
based on the class Q of investment strategies induced by members of F. 

10.5 Prove Lemma 10.2. 


10.6 Consider the following randomized approximation of the universal portfolio. Observe that 
an interpretation of the identity S,,(P, x”) = tb S,(B, x” )u(B)dB is that the wealth achieved 
by the universal portfolio is the expected value, with respect to the density u, of the wealth 
of all constantly rebalanced strategies. This expectation may be approximated by randomly 
choosing N vectors B,,..., By according to the density u and distributing the initial wealth 
uniformly among them just as in the example of Section 10.2. Investigate the relationship of 
the wealth achieved by this randomized strategy and that of the universal portfolio. What value 
of N do you suggest? (Blum and Kalai [33].) 


10.7 Consider a market of m > 2 assets and class Q of all investment strategies that rebalance 
between two assets. More precisely, Q is the class of all constantly rebalanced portfolios 


10.8 


10.9 


10.10 


10.11 
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such that the probability vector B € D characterizing the strategy has at most two nonzero 
components. Determine tight bounds for the minimax logarithmic wealth ratio W,,(Q). Define 
and analyze a universal portfolio for this class. 


(Universal portfolio for smoothly parameterized classes) Let Q be a class of static investment 
strategies that is, each Q € Q is given by a sequence B,,...,B,, of portfolio vectors (i.e., 
B, € D). Assume that the strategies in Q are parameterized by a set of vectors © C R7, that is, 
Q = {Qo = (Bo, ..., Bon) : 0 € ©}. Assume that © is a convex, compact set with nonempty 
interior, and the parameterization is smooth in the sense that 


Bo, = Bos 


|<cle—6'| 

for all 6, 6’ € © and t = 1,...,, where c > 0 is a constant. Let u be a bounded density on 

© and define the generalized universal portfolio by 

So Bie Si—1(Qo, x= Du(0) 0 
Jo S:-1(Qa, x17!) u0) dO 

where Bi, denotes the jthe component of the portfolio vector Bo. Show that if the price 


relatives x; fall between 1/C and C for some constant C > 1, then the worst-case logarithmic 
wealth ratio satisfies 


Pj) = 


; pel wgem. tH 1; 


Sn(Qo, X”) 
sup sup In —-———__ = O (d Inn) 
eP oe SCP, x") 


(Cross and Barron [78]). 


(Switching portfolios) Define class Q of investment strategies that can switch buy-and-hold 
strategies at most k times during n trading periods. More precisely, any strategy in Q € Q is 
characterized by a sequence i4, ..., in € {1,..., m} of indices of assets with size(i4, ... , in) < 
k (where size denotes the number of switches in the sequence; see the definition in Section 5.2) 
such that, at time ż, Q invests all its capital in asset i,. Construct an efficiently computable 
investment strategy P whose worst-case logarithmic wealth ratio satisfies 


Sn ”) 
sup sup In ————— (2.x < (K+ I)Inm +kint 


x geQ 8, (P, 2 - 
(Singer [269]). Hint: Combine Theorem 10.1 with the techniques in Section 5.2. 
(Switching constantly rebalanced portfolios) Consider now class Q of investment strategies 
that can switch constantly rebalanced portfolios at most k times. Thus, a strategy in Q € Q 
is defined by a sequence of portfolio vectors B;,...,B, such that the number of times t = 
1,...,2— 1 with B, 4 B,,, is bounded by k. Construct an efficiently computable investment 
strategy P whose logarithmic wealth ratio satisfies 


Si(Q, x") C n k 
i SP, x") < ci; («+ 1)lnm + (n — 1)H (=) 


whenever the price relatives x;, fall between the constants c < C. Hint: Combine the EG 
investment strategy with the algorithm for tracking the best expert of Section 5.2. 


(Internal regret for sequential investment) Given any investment strategy Q, one may define 
its internal regret (in analogy to the internal regret of forecasting strategies; see Section 4.4), 
for any i, j € {1,..., m}, by 


Qi! . x 
TURS =a Qox 


where the modified portfolio Qe is defined such that its 7th component equals 0, its jth 
component equals Q ;; + Q;,,, and all other components are equal to those of Q,. Construct 
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an investment strategy such that if the price relatives are bounded between two constants c 
and C, then max;,; Ra, jn = O(n'/?) and, at the same time, the logarithmic wealth ratio with 
respect to the class of all constantly rebalanced portfolios is also bounded by O(n!/?) (Stoltz 
and Lugosi [279]). Hint: First establish a linear upper bound as in Section 10.4 and then use 
an internal regret minimizing forecaster from Section 4.4. 


10.12 Prove Theorem 10.5 and Corollary 10.1. 


Il 


Linear Pattern Recognition 


11.1 Prediction with Side Information 


We extend the protocol of prediction with expert advice by assuming that some side 
information, represented by a real vector x; € Rf, is observed at the beginning of each 
prediction round ¢. In this extended protocol we study experts and forecasters whose 
predictions are based on linear functions of the side information. 

Let the decision space D and the outcome space Y be a common subset of the real 
line R. Linear experts are experts indexed by vectors u € R°. In the sequel we identify 
experts with this corresponding parameter, thus referring to a vector u as a linear expert. 
The prediction f,,, of a linear expert u at time f is a linear function of the side information: 
fur = U- X. Likewise, the prediction p; of the linear forecaster at time t is Di = W;-1 - Xr, 
where the weight vector w;_, is typically updated making use of the side information xz. 

This prediction protocol can be naturally related to a sequential model of pattern recogni- 
tion: by viewing the components of the side-information vector as features of an underlying 
data element, we can use linear forecasters to solve pattern classification or regression 
problems, as described in Section 11.3 and subsequent sections. 

As usual, we define the regret of a forecaster with respect to expert u € R? by 


R% = Ey — L, = Ð (tO y) = L x, 90), 


t=1 


where £ is a fixed loss function. 

In some applications we slightly depart from the linear prediction model by considering 
forecasters and experts that, given the side information x, predict with o(u- x), where 
o : R —> Ris a nonlinear transfer function. This transfer function, if chosen in conjunction 
with a specific loss function, makes the proof of regret bounds easier. 

In this chapter we derive bounds on the regret of forecasters using weights of the form 
w= VỌ, where © is a potential function. Unlike in the case of the weighted average 
forecasters using the advice of finitely many experts, we do not define potentials over the 
regret space (which is now a space of functions indexed by R“). Rather, we generalize the 
approach of defining potentials over the gradient of the loss presented in Section 2.5 for 
the exponential potential. To carry out this generalization we need some tools from convex 
analysis that are described in the next section. In Section 11.3 we introduce the gradient- 
based linear forecaster and prove a general bound for its regret. This bound is specialized to 
the polynomial and exponential potentials in Section 11.4, where we analyze the gradient- 
based forecaster using a nonlinear transfer function to control the norm of the loss gradient. 


293 


294 Linear Pattern Recognition 


Section 11.5 introduces the projected forecaster, a gradient-based forecaster whose weights 
are kept in a given convex and closed region by means of repeated projections. This 
forecaster, when used with a polynomial potential, enjoys some remarkable properties. In 
particular, we show that projected forecasters are able to “track” the best linear expert and 
to dynamically tune their learning rate in a nearly optimal way. 

In the rest of the chapter we explore potentials that change over time. The regret bounds 
that we obtain for the square loss grow logarithmically with time, providing an exponential 
improvement on the bounds obtained using static potentials. In Section 11.9 we show that 
these bounds cannot be improved any further. Finally, in Section 11.10 we obtain similar 
improved regret bounds for the logarithmic loss. However, the forecaster that achieves such 
logarithmic regret bounds is different, because it is based on a mixture of experts, similar, 
in spirit, to the mixture forecasters studied in Chapter 9. 


11.2 Bregman Divergences 


In this section we make a digression to introduce Bregman divergences, a notion that plays 
a key role in the analysis of linear forecasters. 

Bregman divergences are a natural way of defining a notion of “distance” on the basis 
of an arbitrary convex function. To ensure that these divergences enjoy certain useful 
properties, the convex functions must obey some restrictions. 

We call Legendre any function F : A — R such that 


1. A C R? is nonempty and its interior int(A) is convex; 

2. F is strictly convex with continuous first partial derivatives throughout int(A); 

3. if X1, X2,...E€ Á is a sequence converging to a boundary point of A, then 
IVE (x,)|| > coasn > ow. 


The Bregman divergence induced by a Legendre function F : A > R is the nonnegative 
function Dr : A x int(A) > R defined by 


Dr(u, v) = F(u) — F (v) — (u— v). VF (v). 


Hence, the Bregman divergence from u to v is simply the difference between F(u) and 
its linear approximation via the first-order Taylor expansion of F around v. Due to the 
convexity of F, this difference is always nonnegative. Clearly, if u = v, Dp (u, v) = 0. 
Note also that the divergence is not symmetric in the arguments u and v, so we will speak 
of Dp (u, v) as the divergence from u to v. 


Example 11.1. The half of the squared euclidean distance 5 |u — v||* is the (symmetric) 
Bregman divergence induced by the half of the squared euclidean norm F(x) = 5 \|x||?. In 
this example we may take A = R’. 


Example 11.2. The unnormalized Kullback—Leibler divergence 


d d 
Dep.) = >> p mZ +) ai- pi) 
i=l i 


i=l 
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Figure 11.1. This figure illustrates the generalized pythagorean inequality for squared euclidean 
distances. The law of cosines states that ||u— w|? is equal to |lu— wI? + Iw — wl? — 
2 |u — w'|| |w" — w|| cos, where @ is the angle at vertex w’ of the triangle. If w’ is the closest 
point to w in the convex set S, then cos@ < 0, implying that ||u — w||? > |Ju— wI? + [lw — wil’. 


is the Bregman divergence induced by the unnormalized negative entropy 


d d 
F(p) = do pilnpi — Yo pi 
=l j= 


defined on A = (0, o0). 


The following result, whose proof is left as an easy exercise, shows a basic relationship 
between the divergences of three arbitrary points. This relationship is used several times in 
subsequent sections. 


Lemma 11.1. Let F : A — R be Legendre. Then, for allu € A and all v, w € int(A), 


Dp(u, v) + Dr (v, w) = Dr (u, w) + (u — v)(VF(w) — VF (v)). 


We now investigate the properties of projections based on Bregman divergences. Let F : 
A — R be a Legendre function and let S C R? be a closed convex set with S N A Æ Ø. 
The Bregman projection of w € int(A) onto S is 


argmin Dp (u, w). 

ueSNA 
The following lemma, whose proof is based on standard calculus, ensures existence and 
uniqueness of projections. 


Lemma 11.2. For all Legendre functions F : A —> R, for all closed convex sets S C R! 
such that AN S 4 Ø, and for all w € int(A), the Bregman projection of w onto S exists 
and is unique. 


With respect to projections, all Bregman divergences behave similarly to the squared 
euclidean distance, as shown by the next result (see Figure 11.1). 
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Lemma 11.3 (Generalized pythagorean inequality). Let F be a Legendre function. 
For all w € int(A) and for all convex and closed sets S C R? with SNA +Ø, if 
w = argminyesn 4 Dr(v, w) then 


Dr(u, w) > Dr(u, w) + Dr(w,w) forall u € S. 
Proof. Define the function G(x) = Dp (x, w) — Dr (x, w). Expanding the divergences 
and simplifying, we note that 
G(x) = —F (w) — (x — w)V F (w) + F(W) + x — W)VF(w’). 
Thus G is linear. Let x, = œu + (1 — a)w’ be an arbitrary point on the line joining u and 
w’. By linearity, G(x,) = œ G (u) + (1 — w@)G(w’) and thus 
Dr (Xa, w) = Dr (Xa, w’) 
= a(Dr(u, w)— Dr (u, w)) +(1-—a@)Dp (w, w) . 
For a > 0, this leads to 
Dr(u, w) — Dr (u, w’) — Dr(w, w) 
_ Dr(&a, W) — Dr (Xa, W) — Dr (wW, w) D —Dr (Xa, W) 


a ’ 


a a 


where we used DF (Xx, W) > DF (w y w). This last inequality is true since w’ is the point in 
S with smallest divergence to w and x, € S since u € S and S is convex by hypothesis. 
Let D(x) = Dr (x, w ). To prove the theorem it is then enough to prove that D(x,)/a = 0 
for some a > 0. Indeed, 


_ D(&) . D(w +a(u—w))— Dw’) 
lim —— = lim ; 
œa—> 0+ Q a—> Or Q 


The last limit is the directional derivative D{_,,,(w’) of D in the direction u — w’ evaluated at 
w (the directional derivative exists in int(A) because of the second condition in the definition 
of Legendre functions; moreover, if w € int(A), then the third condition guarantees that w’ 
does not belong to the boundary of A). Now, exploiting a well-known relationship between 


the directional derivative of a function and its gradient, we find that 
Di_w (W) = (u— w): VD(w). 


Since D is differentiable and nonnegative and D(w’) = 0, we have VD(w’) = 0. This 
completes the proof. W 


Note that Lemma 11.3 holds with equality whenever S is a hyperplane (see Exer- 
cise 11.2). 

To derive some additional key properties of Bregman divergences, we need a few basic 
notions about convex duality. Let F : A —> R be Legendre. Then its Legendre dual (or 
Legendre conjugate) is the function F* defined by 

F*(u) = sup(u -v— F(v)). 
veA 
The conditions defining Legendre functions guarantee that whenever F is Legendre, 
then F* : A* > R is also Legendre and such that A* is the range of the mapping 
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VF : int(A) > R’. Moreover, the Legendre dual F** of F* equals F (see, e.g., Sec- 
tion 26 in Rockafellar [247]). The following simple identity relates a pair of Legendre 
duals. 


Lemma 11.4. For all Legendre functions F, F(u) + F*(w’) =u-w if and only if w = 
VF (u). 


Proof. By definition of Legendre duality, F*(u’) is the supremum of the concave function 
G(x) =u’ - x — F(x). If this supremum is attained at u € R’, then VG(u) = 0, which is 
to say u = VF (u). On the other hand, if u’ = VF (u), then u is a maximizer of G(x), and 
therefore F*(u')=u-u'—F(u). E 


The following lemma shows that gradients of a pair of dual Legendre functions are 
inverses of each other. 


Lemma 11.5. For all Legendre functions F, VF* =(VF)"!. 


Proof. Using Lemma 11.4 twice, 


u =VF(u) ifandonlyif F(u)+ F*(u) =u. u 
u = VF*(u) ifandonlyif F*qw)+F"(w=u-uW. 


Because F*™* = F, the lemma is proved. W 


Example 11.3. The Legendre dual of the half of the squared p-norm i lull; p = 2, is 
the half of the squared g-norm i lullŽ, where p and q are conjugate exponents; that is, 
1/p + 1/4 = 1. The euclidean norm 5 lul]? is the only self-dual norm (it is the dual of 
itself). The squared p-norms are Legendre, and therefore the gradients of their duals are 
inverses of each other, 


sgn(u;) |u|?" 


2\-!1 2 
a= and = (Vi jull3) = Viljull; . 
P 


(Vs llull;), = 


where, as before, 1/p + 1/q = 1. 


Example 11.4. The function F(u) = e"! +---+e4 has gradient VF (u); = e" whose 
inverse is VF'*(v); = Inv;, v; > 0. Hence, the Legendre dual of F is 


d 
F*(v) = X viiny; — 1). 


Note that if v lies on the probability simplex in R? and H (v) is the entropy of v, then 
F*(v) = —(H (v) + 1). 


Example 11.5. The hyperbolic cosine potential 


d 
1 u u 
La 2 +e 
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has gradient with components equal to the hyperbolic sine VF (u); = sinh(u;) = 5(e" — 
e`"). Therefore, the inverse gradient is the hyperbolic arcsine, VF*(v); = arcsinh(v;) = 
In(, [v2 +1+ vi), whose integral gives us the dual of F 


d 


F*(v) = 5 (r arcsinh(v;) — \/v? + i) ; 


i=1 


We close this section by mentioning an additional property that relates a divergence based 
on F to that based on its Legendre dual F*. 


Proposition 11.1. Let F : A— R be a Legendre function. For all u, v € int(A), if uw = 
VF (u) and v' = VF (vy), then Dr(u, v) = Dp» (v’, u’). 


Proof. We have 


Dr(u, v) = F (u) — F (v) — (u — v)- VEF (v) 

= F(u) — F (v) — u- v). v 

=u -u — F*(u) -v -v+ F*(v)— (u-v) v 
(using Lemma 11.4) 

=u -u — F*(u) + F*(v)- u. v 

= F*(v) — F*q’)—(v -w)-u 

= F*(v) — F* (u) — (v — u’) - VF* (u) 
(using u’ = VF (u) and Lemma 11.5) 

= De (v, u’). 


This concludes the proof. E 


11.3 Potential-Based Gradient Descent 


Now we return to the main topic of this chapter introduced in Section 11.1, that is, to the 
design and analysis of forecasters that use the side-information vector x, and compete with 
linear experts, or, in other words, with reference forecasters whose prediction takes the 
form fua, = U- x; for some fixed vector u € R. In this and the four subsequent sections 
we focus our attention on linear forecasters whose prediction, at time t, takes the form 
P: = W;_-1-X;. The weight vector w, used in the next round of prediction is determined 
as a function of the current weight w,—1, the side information x,, and the outcome y,. The 
forecasters studied in these sections differ in the way the weight vectors are updated. We 
start by describing a family of forecasters that update their weights by performing a kind 
of gradient descent based on an appropriately defined potential function. 

To motivate the gradient descent forecasters defined below, we compare them first to 
the potential-based forecasters introduced in Chapter 2 in a different setup. Recall that 
in Chapter 2 we define weighted average forecasters using potential functions in order to 
control the dependence of the weights on the regret. More precisely, we define weights at 
time t by w,_; = V®(R,_;), where R,_; is the cumulative regret up to time t — 1. The 
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convexity of the loss function, through which the regret is defined, entails (via Lemma 2.1) 
an invariant that we call the Blackwell condition: w;_; -r,; < 0. This invariant, together 
with Taylor’s theorem, is the main tool used in Theorem 2.1. 

If ® is a Legendre function, then Lemma 11.5 provides the dual relations w, = V ®(R,) 
and R, = V®*(w,). As ® and ®* are Legendre duals, we may call primal weights the 
regrets R, and dual weights the weights w;, where V ® maps the primal weights to the dual 
weights and V ®* performs the inverse mapping. Introducing the notation 0, = R; to stress 
the fact that we now view regrets as parameters, we see that 0, satisfies the recursion 


6,=0,-1 +r (primal regret update). 


Via the identities 0, = R; = V ®*(w,), the primal regret update can be also rewritten in 
the equivalent dual form 


V®*(w,) = V®*(w_1) +4; (dual regret update). 


To appreciate the power of this dual interpretation for the regret-based update, consider 
the following argument. The direct application of the potential-based forecaster to the class 
of linear experts requires a quantization (discretization) of the linear coefficient domain R? 
in order to obtain a finite approximation of the set of experts. Performing this quantization 
in, say, a bounded region [—W, W]? of R? results in a number of experts of the order of 
(W/e)*, where e is the quantization scale. This inconvenient exponential dependence on 
the dimension d can be avoided altogether by replacing the regret minimization approach 
of Chapter 2 with a different loss minimization method. This method, which we call 
sequential gradient descent, is applicable to linear forecasters generating predictions of the 
form P; = 0;—1 - X, and uses the weight update rule 


0, = 0,1 —AVE;(0;_1) (primal gradient update), 


where 0, € R¢, à > 0 is an arbitrary scaling factor, and we set €,(0;-1) = €(0;_1 - Xr, Ye). 
With this method we replace regret minimization taking place in RY (where N = R¢ in this 
case) with gradient minimization taking place in R“. Note also that, due to the convexity 
of the loss functions €(-, y), minimizing the gradient implies minimizing the loss. 

In full analogy with the regret minimization approach, we may now introduce a potential 
®, the associated dual weights w, = V ®(6,), and the forecaster 


Pr = W1: X (gradient-based linear forecaster) 
whose weights w;_; are updated using the rule 
VO®*(w,) = V®*(w;_1) — AV; (w1) (dual gradient update). 
By rewriting the dual gradient update as 
0, = bi- — AVE:(Wr-1) 


we see that this update corresponds to performing a gradient descent step on the weights 
0,, which are the image of w, according to the bijection V ®*, using V£, (w,—1) rather than 
V£, (0,—1) (see Figure 11.2). 
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Figure 11.2. An illustration of the dual gradient update. A weight w,_; € R°? is updated as follows: 
first, w,_; is mapped to the corresponding primal weight 6,_; via the bijection V®*. Then a gradient 
descent step is taken obtaining the updated primal weight 0,. Finally, 0, is mapped to the dual weight 
w; via the inverse mapping V®. The curve on the left-hand side shows a surface of constant loss. 
The curve on the right-hand side is the image of this surface according to the mapping V ®, where ® 
is the polynomial potential with degree p = 2.6. 


Since V® is the inverse of V®*, it is easy to express explicitly the update in terms of 
Wi: 


w= Vo(Vor(w,_) = AVE(W,-1)). 


A different intuition on the gradient-based linear forecaster is gained by observing that w, 
may be viewed as an approximate solution to 


min [Do (u, W1) + atw]. 


ucR? 


In other words, w, expresses a tradeoff between the distance from the old weight w,_, 
(measured by the Bregman divergence induced by the dual potential ®*) and the loss 
suffered by w, if the last observed pair (x,, y;) appeared again at the next time step. To 
terminate the discussion on w,, note that w, is in fact characterized as a solution to the 
following convex minimization problem: 


min| Do: (u, wii) + &(€:(Wi-1) + (= wD w) 


ucR? 


This second minimization problem is an approximated version of the first one because 
the term ¢;(w;_1) + (u — w;_1)V €;(W;_1) is the first-order Taylor approximation of ¢,;(u) 
around W;—1. 

We call regular loss function any convex, differentiable, nonnegative function £ : R x 
R — R such that, for any fixed x, € R? and yı € R, the function €,(w) = £(w - x,, y;) is 
differentiable. The next result shows a general bound on the regret of the gradient-based 
linear forecaster. 


Theorem 11.1. Let € be a regular loss function. If the gradient-based linear forecaster 
is run with a Legendre potential ®, then, for all u € R, the regret R,(u) = L, — L, (u) 
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Satisfies 


n 


1 1 
R,(u) < ~D- (U, Wo) + > XO Dos (wri, Wi). 


t=1 


Proof. Consider any linear forecaster using fixed weights u € R°. Then 


£;(Wr-1) < £U) — (U — w1) VEW) 


(by Taylor’s theorem and using convexity of £) 
1 
= E(u) + U — wri): (VE (w) — V*(w,-1)) 
(by definition of the dual gradient update) 
1 
= ¢,(u) + z (Par, w,-1) — Da (U, w,) + Do (Wii, W,)) 
(by Lemma 11.1). 


Summing over ¢ and using the nonnegativity of Bregman divergences to drop the 
term — Da (u, w„) completes the proof. E 


At a first glance, the bound of Theorem 11.1 may give the impression that the larger 
the à, the smaller the regret is. However, a large value of à may make the weight vectors 
w, change rapidly with time, causing an increase in the divergences in the second term. In 
the concrete examples that follow, we will see that the learning rate à needs to be tuned 
carefully to obtain optimal performance. 

It is interesting to compare the bound of Theorem 11.1 (for A = 1) with the bound of 
Theorem 2.1. Rewriting both bounds in primal weight form, we obtain, setting wo = V®(0), 


O(R,) < DO) + $` Dor, 0-1) 


t=1 


R,(u) < Do(0, VO*(u)) + $` Do (9), 0,1), 


t=1 


where the 0, are updated using the primal regret update (as in Theorem 2.1) and the 0, 
are updated using the primal gradient update (as in Theorem 11.1). Note that, in both 
cases, the terms in the sum on the right-hand side are the divergences from the old primal 
weight to the new primal weight. However, whereas the first bound applies to a set of N 
arbitrary experts, the second bound applies to the set of all (continuously many) linear 
experts. 


11.4 The Transfer Function 


To add flexibility to the gradient-based linear forecaster, and to make its analysis easier, 
we constrain its predictions P; = w;_1 - X; by introducing a differentiable and nondecreas- 
ing transfer function o : R —> R and letting P, = o (w;—1 - x;). Similarly, we redefine the 
predictions of expert u as o(u- x,). and let £7 (w;_1) = L(o (Ww -X,), Yt): The forecaster 
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predicting with a transfer function, which we simply call gradient-based forecaster, is 
sketched in the following: 


THE GRADIENT-BASED FORECASTER 
Parameters: learning rate A > 0, transfer function o. 
Initialization: wọ = V ®(0). 
For each round t = 1, 2,... 


(1) observe x, and predict P, = o (W;—1 - X+); 
(2) get y; € R and incur loss £? (w,—1) = €(P;, Yr); 
(3) let w, = V®(Vb*(w,_1) — AVE? (w,_1)). 


The regret of the gradient-based forecaster with respect to a linear expert u takes the form 


Rew) = P (Ew) — gw). 
t=1 

Note that, for fairness, the loss of the forecaster P, = o (W;—1 - X+) is compared with the 
loss of the “transferred” linear expert o (u - x;). Instances of the gradient-based forecaster 
obtained by considering specific potentials correspond to well-known pattern recognition 
algorithms. For example, the Widrow—Hoff rule [310] w; = w_1 — A(wW;-1 - X; — yr) is 
equivalent to the gradient-based forecaster using the quadratic potential (i.e., the polynomial 
potential with p = 2), the square loss, and the identity transfer function o(p) = p. The EG 
algorithm of Kivinen and Warmuth [181] corresponds to the forecaster using the exponential 
potential. Regret bounds for these concrete potentials are derived later in this section. 

To apply Theorem 11.1 with transfer functions, we have to make sure that the loss 
£(o(w- x), y) is a convex function of w for all y. Because w -x is linear, it suffices 
to guarantee that £(o (v), y) is convex in v. We call nice pair any pair (ø, £) such that 
£(a(-), y) is convex for all fixed y. Trivially, any regular loss function forms a nice pair 
with the identity transfer function. Less trivial examples are as follows. 


Example 11.6. The hyperbolic tangent o(v) = (e” — e~)/(e” +e™”) € [-1, 1] and the 
entropic loss 


I+y. I+y l-y, 1l-y 
In In 
2 1+ p 2 l-—p 


tp, y)= 


form a nice pair. 


Example 11.7. The logistic transfer function oø (v) = (1 + e~”)~! € [0, 1] and the Hellinger 


loss £(p, y) = (VYP — Jy) +(/1T— p— VI = y} form a nice pair. 


To avoid imposing artificial conditions on the sequence of outcomes and side information, 
we focus on nice pairs (ø, £) satisfying the additional condition 
( dé(a( 


2 
ao) < æ Lov), y) for some a > 0. 
Vv 
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We call such pairs a-subquadratic. The nice pair of Example 11.6 is 2-subquadratic and 
the nice pair of Example 11.7 is 1/4-subquadratic. The nice pair formed by the identity 
transfer function and the square loss £(p, y) = 5( p — y} is 1-subquadratic. (To analyze the 
gradient-based forecaster, it is more convenient to work with this definition of square loss 
equal to half of the square loss used in previous chapters.) We now illustrate the regret bounds 
that we can obtain through the use of w-subquadratic nice pairs. Let R7 (u) = Le — L? (u), 
where 
Te = > L(o (w1 -X,), yr) and L? (u) = X e(o -X;), yr). 


t=1 t=1 


Polynomial Potential 

The polynomial potential ||u, IF defined in Chapter 2 is not Legendre because, owing to the 
(-)4 operator, it is not strictly convex outside of the positive orthant. To make this potential 
Legendre, we may redefine it as ®,(u) = 5 \|u||2,, where we also introduced an additional 
scaling factor of 1/2 so as to have $5 (u) = ®, (u) = 5 lul (see Example 11.3). It is easy 
to check that all the regret bounds we proved in Chapter 2 for the polynomial potential still 


hold for the Legendre polynomial potential. 


Theorem 11.2. For any a-subquadratic nice pair (o, £), if the gradient-based fore- 
caster using the Legendre polynomial potential ®, is run with learning rate à = 
2e/((p —l)a X?) on a sequence (Xj, y1), (X2, Y2)... € R? x R, where 0 < e < 1, then 
fordllue R7 and for all n > 1 such that max;=1,....n Ixl, < Xp, 


yassi 


o 2 = 2 
as _ La) lulls f (p — l)a Xp 
"~1-e (l-e) 4 


where q is the conjugate exponent of p. 


Proof. We apply Theorem 11.1 to the Legendre potential ®,(u) and to the convex loss 
£? (-) and obtain 


R, @) < =D (u, Wo) + : ar (Wr-1, Wr). 
i E A t=1 : 
Since the initial primal weight 09 is 0, wo = V®,(0) = 0 and ©, (wo) = 0. This implies 
Do, (u, Wo) = ©,(u). As for the other terms, we simply observe that, by Proposition 11.1, 
Do, (Wr-1, Wr) = Do, Or, 0;-1), where 0, = V®,(w;) and 0, = 0;_; — A V£? (w;_1). We 
can then adapt the proof of Corollary 2.1 replacing N by d and r, by —AV£? (w;_1) and 
adjusting for the scaling factor 1/2. This yields 


® ie 
R; (u) < em F x X De, (Wi, w) 
t=1 
® die 
< WO sp D> DO [ve wol, 
t=1 


IA 


®,(u) AH deo), yi) 
L Tg DD ( dv, 


2 
2 
) EA 
Vp=Wr-1°Xr 
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< W 
T à 


x n 
+o- D5 De EF (wi) Il, 
t=1 


(as (o, £) is a-subquadratic) 
< W 
~ HX 


a A 2 Fo 
+P-D7 XÊ. 


Note that R7 (u) = T? — L? (u). Rearranging terms, substituting our choice of A, and using 
the equality 5D, (u) = lul yields the result. W 


Choosing, say, € = 1/2, the bound of Theorem 11.2 guarantees that the cumulative loss 
of the gradient-based forecaster is not larger than twice the loss of any linear expert plus 
a constant depending on the expert. If ¢ is chosen to be a smaller constant, the factor of 2 
in front of L? (u) may be decreased to any value greater than 1, at the price of increasing 
the constant terms. One may be tempted to choose the value of the tuning parameter £ so 
as to minimize the obtained upper bound. However, this tuned ¢ would have to depend on 
the preliminary knowledge of L? (u). Since the bound holds for all u, and u is arbitrary, 
such optimization is not feasible. In Section 11.5 we introduce a so-called self-confident 
forecaster that dynamically tunes the value of the learning rate À to achieve a bound of the 
order ,/L°(u) for all u for which ®,(u) is bounded by a constant. Note that this bound 
behaves as if à had been optimized in the bound of Theorem 11.2 separately for each u in 
the set considered. 


Exponential Potential 
As for the polynomial potential, the exponential potential defined as 


i 4 
—In ye ei 
se 


is not Legendre (in particular, the gradient is constant along the line uw; = u2 =--- = 
uq, and therefore the potential is not strictly convex). We thus introduce the Legendre 
exponential potential ®(u) = e"! + - -- + e" (the parameter 7, used for the exponential 
potential in Chapter 2, is redundant here). Recalling Example 11.4, V®*(w); = In w;. For 
this potential, the dual gradient update w, = VO(V O*(w;_1) —A Ve, (wr—1)) can thus be 
written as 


Wie = exp(Inwj;-1 — AVE(W_1)i) = Wisi ete fori =1,...,d. 
Adding normalization, which we need for the analysis of the regret, results in the final 


weight update rule 


Wi -167° Vei(Wr-1)i 


Dai Wj r-1€ 


Wit = A 
—AVE(Wi-1)j 


This normalization corresponds to a Bregman projection of the original weight onto the 
probability simplex in R“, where the projection is taken according to the Legendre dual 


d 


©*(u) = X uinu; — 1) 


i=1 
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(see Exercise 11.4). A thorough analysis of gradient-based forecasters with projected 
weights is carried out in Section 11.5. 

The gradient-based forecaster using the Legendre exponential potential with projected 
weights is sometimes called the EG (exponentiated gradient) algorithm. Although we used 
the Legendre version of the exponential potential to be able to define EG as a gradient-based 
forecaster, it turns out that the original potential is better suited to prove a regret bound. For 
this reason, we slightly change the proof of Theorem 11.1 and use the standard exponential 
potential. Finally, because the EG forecaster uses normalized weights, we also restrict our 
expert class to those u that belong to the probability simplex in R¢. 


Theorem 11.3. For any a-subquadratic nice pair (o, £), if the EG forecaster is run 
with learning rate à = 2¢e/(a x2) on a sequence (X1, y1), (X2, y2)... € R? x R, where 
0 <e <1, then for all u € R? in the probability simplex and for all n > 1 such that 
max;=1,....n |X+lloo < Xoo, 


jes 


pA < En) a X2 Ind 
"—~1-e 2e1—s) 


Proof. Recall the (non-Legendre) exponential potential, with n = 1, 
d 
(u) = In Se ar 
i=l 


We now go through the proof of Theorem 11.1 using the relative entropy 


d 


Ui 
D(allv) = i ln —. 

liv) 2 uj In 
(see Section A.2) in place of the divergence Do«(u, v). Example 11.2 tells us that the 
relative entropy is the Bregman divergence for the unnormalized negative entropy potential 


d d 


@(u) = >on; Inu; — È u;i, u € (0, 00)’, 


i=l i=1 


in the special case when both u and v belong to the probability simplex in R¢. 
Let w! , = w; s167 Yr and let w, be w, normalized. Just as in Theorem 11.1, we 
begin the analysis by applying Taylor’s theorem to £: 


L (w1) — £ 0) < —(u — w1) VE (w1). 


Introducing the abbreviation z = AV £? (w;_1) and the new vector v with components v; = 
W;-1 ` Z — Zi, we proceed as follows: 


—(U — W1): Z 


d d 
=-u-z+w,_;-z-—In (>: rte] + In (>: ve") 


i=l i=l 


d d 
=-u-z-—In bs me) + In (£ mite] 
i=1 =I 
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=u ; Ine™™ —In (Eru 1e “J+ In (Ere) 
1 W jpe 74 a 
= > uj In ae ~}t+In XO wire” 
: Wjt-1 Ji Wir—-1e7 7 é 


d 
= D(ul|w,—1) — D(ul|w,) + In (> wue") : 


i=1 
Note that, as in Theorem 11.1, we have obtained a telescoping sum. However, unlike 
Theorem 11.1, the third term is not a relative entropy. 

On the other hand, bounding this extra term is not difficult. Since w,_; belongs to the 
simplex, we may view v1, ..., Va as the range of a zero-mean random variable V distributed 
according to w,—1. To this end, let 


dé(o(v), yr) 


C= 
i dv 


IIx: Ilo 


v=W;—1'X; 


so that z; € [—AC;, AC;]. Applying Hoeffding’s inequality (Lemma A.1) to V then yields 


d 272 
àC 
n( ) vine") = A : 
i=1 


Summing overt = 1, ..., gives 


n 


D 
So (E(w) -g w) < a o a 


t=1 


where the negative term —D(ul|w,) has been discarded. Using the assumption that 
(o, £) is a-subquadratic, we have C : < al? (wr) X 2 Moreover, 09 = 0 implies that, 
after normalization, Wọ = (1/d,..., 1/d). This gives D(u||wo) < Ind (see Examples 11.2 
and 11.4). Substituting these values in the above inequality we get 
Ru) < Ind Fs Che qe. 
u a es 
À 2 

Note that this inequality has the same form as the corresponding inequality at the end of 
the proof of Theorem 11.2. Substituting our choice of à and rearranging yields the desired 
result. W 


Note that, as for the analysis in Chapter 2, we can obtain a bound equivalent to that of 
Theorem 11.3 by using Theorem 11.2 with the polynomial potential tuned to p = 2 Ind. 

The exponential potential has the additional limitation of using only positive weights. 
As explained in Grove, Littlestone, and Schuurmans [133], this limitation can be actually 
overcome by feeding to the forecaster modified side-information vectors x, € R”, where 


/ 
X, => (—X145 Xit> —X2,t5 X25 sey —Xad,t> Xd.) 
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and x; = (X17, .--,Xa,r) is the original unmodified vector. Note that running the gradient- 
based linear forecaster with exponential potential on these modified vectors amounts to 
running the same forecaster with the hyperbolic cosine potential (see Example 11.5) on the 
original sequence of side information vectors. 

We close this section by comparing the results obtained so far for the gradient-based 
forecaster. Note that the bounds of Theorem 11.2 (for the polynomial potential) and The- 
orem 11.3 (for the exponential potential) are different just because they use different pairs 
of dual norms. To allow a fair comparison between these bounds, fix some linear expert u 
and consider a slight extension of Theorem 11.3 in which we scale the simplex such that 
each w; is projected enough to contain the chosen u. Then the bound for the exponential 
potential takes the form 


Liu) 
l—e 2e(1 —-e) 


x (Ind) llull? XZ, 


where X œ = max; ||X;||.o. The bound of Theorem 11.2 for the polynomial potential is very 
similar: 
L? (u) a p— 


1 2 y2 
x lul? x2, 
l—e 2e(l-—e) 2 


where X , = max, ||x;||,, and (p, q) are conjugate exponents. Note that the two bounds differ 
only because the sizes of u and x; are measured using different pairs of dual norms (1 is the 
conjugate exponent of oo). For p ~ 21nd the bound for the polynomial potential becomes 
essentially equivalent to the one for the exponential potential. To analyze the other extreme, 
p = 2 (the spherical potential), note that, using ||v||,. < Ilvll2 < llvil, for all v € Rf, it 
is easy to construct sequences such that one of the two potentials gives a regret bound 
substantially smaller than the other. For instance, consider a sequence (x1, y1), (X2, y2)... 
where, for all t, x, € {—1, 1}4 and y= u! -x for u = (1,0,...,0). Then lulż X2 =d 
and lull? X a = |. Hence, the exponential potential has a considerable advantage when a 
sequence of “dense” side information x, can be well predicted by some “sparse” expert u. In 
the symmetric situation (sparse side information and dense experts) the spherical potential 
is better. As shown by the arguments of Kivinen and Warmuth [181], these discrepancies 
turn out to be real properties of the algorithms, and not mere artifacts of the proofs. 


11.5 Forecasters Using Bregman Projections 


In this section we introduce a modified gradient-based linear forecaster, which always 
chooses its weights from a given convex set. This is done by following each gradient-based 
update by a projection onto the convex set, where the projection is based on the Bregman 
divergence defined by the forecaster’s potential (Theorem 11.3 is a first example of this 
technique). Using this simple trick, we are able to extend some of the results proven in the 
previous sections. 

In essence, projection is used to guarantee that the forecaster’s weights are kept in a 
region where they enjoy certain useful properties. For example, in one of the applications 
of projected forecasters shown later, confining the weights in a convex region S of small 
diameter allows to effectively “track” the best linear forecaster as it moves around in the 
same region S. This is reminiscent of the “weight sharing” technique of Section 5.2, where, 
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in order to track the best expert, the forecaster’s weights were kept close to the uniform 
distribution. 
Let S C R? be a convex and closed set and define the forecaster based on the update 


w, = Po (w, S) (projected gradient-based linear update), 


where w, = V® (V ®*(w,_1) —AVE,(W;_1 )) is the standard dual gradient update, used here 
as an intermediate step, and P».(v, S) denotes the Bregman projection argmin,.; Do«(Uu, v) 
of v onto S (if v € S, then we set P»:(v, S) = v). We now show two interesting applications 
of the projected forecaster. 


Tracking Linear Experts 
All results shown so far in this chapter have a common feature: even though the bounds 
hold for wide classes of data sequences, the forecaster’s regret is always measured against 
the best fixed linear expert. In this section we show that the projected gradient-based linear 
forecaster has a good regret against any linear expert that is allowed to change its weight at 
each time step in a controlled fashion. More precisely, the regret of the projected forecaster 
is shown to scale with a measure of the overall amount of changes the expert undergoes. 
Given a transfer function o and an arbitrary sequence of experts (u,;) = Ug, U1, ... € RI, 
define the tracking regret by 


Re ((u,)) = Ez — LD = >> g w) Y E i). 
t=1 t=1 


Theorem 11.4. Fix any a-subquadratic nice pair (ø, £). Let 1/p + 1/4 = 1, U; > 0, and 
£ € (0, 1) be parameters. If the projected gradient-based forecaster based on the Legendre 
polynomial potential ®, is run with S = {w eR’: ®,(w) < U,} and learning rate à = 
2e/((p — la X?) on a sequence (X1, y1), (X2, y2)... in R? x R, then for all sequences 


(u) = uo, U; ... € S and for all n > 1 such that maX,=1,...n |X ll, < Xp, 


sais 


=, . L2((u))  (p— Da X? Ll 
ee eee 2 (yaY Dla — wll; + 


Observe that the smaller the parameter U4, the smaller the second term becomes in the 
upper bound. On the other hand, with a small value of Uz, the set S shrinks and the loss of 
the forecaster is compared to sequences (u;) taking values in a reduced set. 

Note that when up = u; = --- = uy, the regret bound reduces to the bound proven in 
Theorem 11.2 for the regret against a fixed linear expert. The term 


n 
Se lui = wl 
t=1 


can thus be viewed as a measure of “complexity” for the sequence (u,). 
Before proving Theorem 11.4 we need a technical lemma stating that the polynomial 
potential of a vector is invariant with respect to the invertible mapping V ®,. 
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Lemma 11.6. Let ®, be the Legendre polynomial potential. For all 0 € R’, ®,(0) = 
d, (Vo,()). 


Proof. Letw = V®,(@). By Lemma 11.4, &,(0) + (w) = 0 - w. By Hélder’s inequal- 
ity, 0- w < 2,/®,(0)@,(w). Hence, 


(Va - JE) <0, 


which implies ®,(0) = ®,(w). E 


We are now ready to prove the theorem bounding the tracking regret of the projected 
gradient-based forecaster. 


Proof of Theorem 11.4. From the proof of Theorem 11.1 we get 
7 (wi-1) — £7 (Uy-1) 


1 
< ~ (Do, 0m, wi-1) — Do, (u-1, W,) + Do, (Wr-1, w)) 


1 
< al ©, (Wr-1; W1) — Do, (1, Wi) + Do, (Wr-1, w,)), 


where in the second step we used the fact that, since u,_; E€ S, by the generalized 


pythagorean inequality Lemma 11.3, 


Do, (u-1, w;) = De, (uy-1, Wr) + Do, (W,, wr) 
> Do, (W-1, Wr). 


With the purpose of obtaining a telescoping sum, we add and subtract Do, (U;—1, Wz) — 
Do,(Ur, W,) in the last formula, obtaining 


€7 (Wr-1) — & (Wy-1) 
< 1 (De, 1, w1) — Do, (U, Wr) 
— Do, (w1, Wr) + Do, (W, Wi) + Do, (Wri, w)); 
We analyze the five terms on the right-hand side as we sum for t = 1, ...,n. The first two 


terms telescope; hence — as in the proof of Theorem 11.2 — we get 


n 


Y (Do, (1, W1) — Do, (Ur, Wr) < Pq (Uo). 


t=1 


Using the definition of Bregman divergence, the third and fourth terms can be rewritten as 


—De,-1, w) + Do, u, w) 
= O,(u,) — ®g(u,_1) + -1 — u) - VO (w). 
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Note that the sum of @,(u,) — ®,(u;_1) telescopes. Moreover, letting 0; = V ®,(w,), 


(u;-1 — uw): Vw) = (w1 — w): 8; 


< 2,/®, (u,_1 — u,)®,(0;) (by Hölder’s inequality) 


= 2,/®,(u,_; — u,)®,(w,) (by Lemma 11.6) 
< |u: — ull, J2U, (since w, € S). 
Hence, 


n 


-$ (De, 1, W:) + Do, (Wr, W,)) 


t=1 


< (uy) — Dq (U0) + 20, X lu- — wll, - 
t=1 


Finally, we bound the fifth term using the techniques described in the proof of Theorem 11.2: 


Á AA aa 
D2 De, (w1 W) < P- DŽ X Eg. 
t=1 
Hence, piecing everything together, 


n 


DOE wa) = £? (u,-1)) 


t=1 


< ®, (uo) + p, (un) = ®, (Uo) 
mi À 


JZU AÀ 2c 
+ Er lui = url +P- D> Xp En 


®,(u,) i V2 


4 5 I I ( 1) a z 
U;_1 —U + p Kab 
À À = ae ang 2 Pat 


Rearranging, substituting our choice of A, and using the equality ®,(u,,) = 5 u, I? yields 
the desired result. W 


Self-Confident Linear Forecasters 

The results of Section 11.4 leave open the problem of finding a forecaster with a regret 
growing sublinearly in L4 (u). If one limited the possible values of u to a bounded subset, 
then by choosing the learning rate A as a function of n (assuming that the total number of 
rounds n is known in advance), one could optimize the bound for the maximal regret and 
obtain a bound that grows at a rate of ./n. However, ideally, the regret bound should scale 
as JL? (u) for each u. Next, we show that, using a time-varying learning rate, the regret of 
the projected forecaster is bounded by a quantity of the order of VLS) uniformly over 
time and for all u in a region of bounded potential. 
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SELF-CONFIDENT GRADIENT-BASED FORECASTER 


Parameters: o-subquadratic nice pair (o,£), reals p > 2, U, > 0, convex set 
S= {ue R! : &,(u) < Uz}, where 1/p + 1/q = 1. 

Initialization: wo = V ,(0). 

For each round t = 1,2,... 


(1) observe x, and predict P, = o (W;—1 - X+); 
(2) get y, € R and incur loss £f (w;-1) = (Pr, yz); 
(3) let w, = V®, (VE, (w1) —A,Ve? (w;-1)); where 


bi kı 
Se Xp = max Xs lp» = 
(p — Da Xp s=1,..,t ky + L° 


Àr 


t 
ky = (p — DaX? Ug, E? = >> E(w); 
s=1 


(4) let w, = Po, (w;, S). 


A similar result is achieved in Section 2.3 of Chapter 2 by tuning the exponentially weighted 
average forecaster at time ¢ with the loss of the best expert up to time f. Here, however, 
there are infinitely many experts, and the problem of tracking the cumulative loss of the best 
expert could easily become impractical. Hence, we use the trick of tuning the forecaster at 
time ¢ + 1 using his own loss Le. Since replacing the best expert’s loss with the forecaster’s 
loss is justified only if the forecaster is doing almost as well as the currently best expert 
in predicting the sequence, we call this the “self-confident” gradient-based forecaster. The 
next result shows that the forecaster’s self-confidence is indeed justified. 


Theorem 11.5. Fix any a-subquadratic nice pair (o, £). If the self-confident gradient-based 
forecaster is run on a sequence (Xj, y1), (X2, y2)... € R’ x R, then for allue R? such 
that ®,(u) < U; and for all n > 1, 


RZ < 5\/(p = Da X2Uy Leu) + 30(p — 1a X2Us, 


where Xp = MaX;=1,...n ||X; Il p- 


Note that the self-confident forecaster assumes a bound 2U, on the norm ||u||, of the 
linear experts u against which the regret is measured. The gradient-based forecaster of Sec- 
tion 11.4, instead, assumes a bound X, on the largest norm ||x;||,, of the side-information 
sequence X),..., Xn. Since u- x; < ||ull, Ilx; |l, the two assumptions impose similar con- 
straints on the experts. 

Before proceeding to the proof of Theorem 11.5, we give two lemmas. The first states that 
the divergence of vectors with bounded polynomial potential is bounded. This simple fact is 
crucial in the proof of the theorem. Note that the same lemma is not true for the exponential 
potential, and this is the main reason why we analyze the self-confident predictor for the 
polynomial potential only. 
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Lemma 11.7. For allu, v € Rf such that ®,(u) < U; and ©,(v) < Ug, Do, U, V) < 4 U4. 
Proof. First of all, note that the Bregman divergence based on the polynomial potential 
can be rewritten as 

Do, (u, v) = &,(u) + Pav) —u- Vw). 
This implies the following: 


Do, (U, v) 
< @,(u) + $, (v) + lu - V, (w)| 


< @,(u) + @,(v) + 2/0, (u)®, (v p, (w)) (by Hölder’s inequality) 


= ®,(u) + Pv) + 2y 8U) ®, (wW) (by Lemma 11.6) 
<4U, 


and the proof is concluded. E 
The proof of the next lemma is left as exercise. 


Lemma 11.8. Let a, £, ..., £l, be nonnegative real numbers. Then 


2 <2 a+ 4-va 


t=1 Ja+ ib = t=1 


Proof of Theorem 11.5. Proceeding as in the proof of Theorem 11.4, we get 


£; (w1) — & (u) 
1 J 
=e — (Do, (u, W-1) — Do, (u, w,) + Do, (wi-1.w,)). 


Now, by our choice of ,;, 


1 
— (De, u, w,-1) — Do, (U, Wr) + Do, (Wi-1, w)) 
t 


x2 Xx? 
en 1 Pip Pip 1 D / 
=(p — la Fa ©, U, W1) — 3. ©, (U, W,) E ©, (W1, W,) 


X? X? 
=(p- la (= Do, U, Wi-1) — a De, (u, w) 
f t+1 


X? X? 1 
+ De, (u, w;) (= H) +y De, (w1, W,) 
t+1 t 


B; i Bit 


x? xX? 1 
moa eaa pi Tr Da, (w wi), 
Bii B; 


xX? xX? 
<(p- Da (an u, w1) — = De, (u, w) 
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where we used Lemma 11.7 in the last step. To bound the last term, we again apply 
techniques from the proof of Theorem 11.2: 


1 ar B 
Pe (w-1: W) < (P X? (wa) < = e? (w1). 
We now sum over ¢ = 1,..., n. Note that the quantity X ae 41 /Bn+1 iS a free parameter 


here, and we conveniently set B,+1 = n and X p.n41 = X p,n. This yields 


n 


XO (2 w) — €7 (uw) 


t=1 


Xx? X2? X2 1 n 
2 -1 p.l p,n+1 pl ee x 
Si k e a i Bn41 By +5 Lb e (Wr-1) 


n 


<4(p = aU, 2 ee 2 be (w1), 


where we again applied Lemma 11.7 to the term Do, (u, Wo). Recalling that k, = (p — 
1D X = Uq, we can rewrite the above as 


n 


Fe 1 n 
DAG Or) — €7(u)) < 4y kn(kn + £2) + . DBE) 
4? (w1) 
< Ay kn (kn + L2) + 
vee Vi ia 


where we used 


Applying Lemma 11.8, we then immediately get 


Ê? — LIC) < 4y/ kulka + LI) + y kalka + £9) = Syl ky (ky +22). 


Solving for Le and overapproximating gives the desired result. W 


Projected gradient-based forecasters can be used with the exponential potential. How- 
ever, the convex set S onto which weights are projected should not be taken as the probability 
simplex in R¢, which is the most natural choice for this potential (see, e.g., Theorem 11.3). 
The region of the simplex where one or more of the weight components are close to 0 would 
blow up either the dual of the exponential potential (preventing the tracking analysis of 
Theorem 11.4) or the Bregman divergence (preventing the self-confident analysis of Theo- 
rem 11.5). A simple trick to fix this problem is to intersect the simplex with the hypercube 
[6/d, 1]“, where 0 < 6 < 1 is a free parameter. This amounts to imposing a lower bound 
on each weight component (see Exercise 11.10). 
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11.6 Time-Varying Potentials 


Consider again the gradient-based forecasters introduced in Section 11.3 (without the use 
of any transfer function). To motivate such forecasters, we observed that the dual weight 
w, is a solution to the convex minimization problem 


min] De-a, w1) +A( G (w1) + 0 = w-wa) |, 
ue 

where the term £;(w;—1) + (u — w;_1)V£;(W;_1) is the first-order Taylor approximation 
of £, (u) around w;_;. As mentioned in Exercise 11.3, an alternative (and more natural) 
definition of w, would look like 


Ww, = argmin| Do-(u, wri) + 2 6,(u)]. 


ucR? 


An interesting closed-form solution for w, in this expression is obtained for a potential 

that evolves with time, where the evolution depends on the loss. In particular, let ® be an 
arbitrary Legendre potential and ®* its dual potential. Define the recurrence 

5 = &* 

time-varying potential). 

CeT ( ying p ) 


Here, as usual, £;(w) = £(w - X;, y+) is the convex function induced by the loss at time ¢ and, 
conventionally, we let £o be the zero function. If the potentials in the sequence 5, Pj, ... 
are all Legendre, then for all t > 1, the associated forecaster is defined by 


W = argmin| Do; ,(u, w,—1) + aw], b= S 


ucR?¢ 


where Wo = 0. (Note that, for simplicity, here we take A = 1 because the learning parameter 
is not exploited below.) This can also be rewritten as 


W = argmin| Do; ,(u, w1) + ®* (u) — jw]. 


ucR’ 


By setting to O the gradient of the expression in brackets, one finds that the solu- 
tion w, is defined by V®*(w,) = V®*_,(w;_1). Solving for w, one then gets w, = 
VO, (Vb*_,(w,-1)). Note that, due to VO*(w,) = V*_,(w,_1), the above solution can 
also be written as w, = V®,(00), where 09 = V®*(0) is a base primal weight. This shows 
that, in contrast to the fixed potential case, no explicit update is carried out, and the poten- 
tial evolution is entirely responsible for the weight dynamics. The following two diagrams 
below illustrate the evolution of primal and dual weights in the case of fixed (left-hand side) 
and time-varying (right-hand side) potential. 


00 0; vee 0, bo 
v) vl | vol OT 


Wo Wi dki W; Wo Wi sea W; 
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Remark 11.1. If V&*(0) = 0, then V&*(w,) = 0 for all ¢ > 0 and one can equivalently 
define w, by 


w; = argmin ®*(u), 
ucR? 


where, we recall, ®¥ is convex for all ż (see also Exercise 11.13 for additional remarks on 
this alternative formulation). 


Remark 11.2. Note that expanding the Bregman divergence term in the above definition of 
gradient-based linear forecaster, and performing an obvious simplification, yields 


wW = argmin| F(a) — OF (wi) +(u— w)VOF(w-1)]. 


ucR? 


The term in brackets is the difference between ®¥(u) and the linear approximation of ®*_, 
around w;—1, which again looks like a divergence. 


The gradient-based forecaster using the time-varying potential defined above is sketched 
in the following. 


GRADIENT-BASED FORECASTER WITH TIME-VARYING POTENTIAL 
Initialization: wo = 0 and j = &*. 
For each round t = 1, 2,... 


(1) observe x, and predict P, = wy—1 - Xr; 
(2) get y, € R and incur loss €,(w,;_1) = LD, yz); 
(3) let w, = V®,(V*_,(w,_1)). 


Theorem 11.6. Fix a regular loss function £. If the gradient-based linear forecaster is run 
with a time-varying Legendre potential ® such that V&*(0) = 0, then, for allu € RË, 


n 
R,(u) = Dos (a, Wo) — Doz (a, Wn) + D> Dos (Wii, Wi). 


t=1 


Proof. Choose any u € R?. Using V®*(w,) = 0 for all £ > 0 (see Remark 11.1), one 
immediately gets Do:(u, w,) = PF (u) — OF (w,) for all u € R’. Since ®*(u) = * u) + 
£,(u), we get £U) = Da (u, w,) + ®7(w,) — P% u) and £, (w:-1) = Doz (Wi-1, Wr) + 
D*(w,) — ®¥_ (w1). This yields 
lL (Ww-1) — b- 0) 
= Do:(W;-1, Wr) — $% (w1) — Da: (u, w,) + $70) 


= Do; (W:-1, Wr) — Do; (U, w:) + Do: (U, w1). 


Summing over f gives the desired result. W 
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Note that Theorem 11.6 provides an exact characterization (there are no inequalities!) of 
the regret in terms of Bregman divergences. In Section 11.7 we show the elliptic potential, 
a concrete example of time-varying potential. Using Theorem 11.6 we derive a bound on 
the cumulative regret of the gradient-based forecaster using the elliptic potential. 


11.7 The Elliptic Potential 


We now show an application of the time-varying potential based on the polynomial potential 
(with p = 2) and the square loss. To this end, we introduce some additional notation. 

A vector u is always understood as a column vector. Let (-)' be the transpose operator. 
Then u’ is a row vector, A' is the transpose of matrix A, and u'v is the inner product 
between vectors u and v (which we also denote with u - v). Likewise, we define the outer 
product uv' yielding the square matrix whose element in row i and column j is u;v j. 

Let the d x d matrix M be symmetric, positive definite and of rank d. Let v € R? and 
c € R be arbitrary. The triple (M, v, c) defines the potential 


1 
(u) = zu Mu+ulv+c (the elliptic potential). 


1/2 exists, and therefore we can also write 


Note that because M is positive definite, M 
(u) = 5 || 17 2uj? +u' v+ c. The elliptic potential is easily seen to be Legendre with 
V®(u) = M u + v. Since M is full rank, M~! exists and V®*(u) = M~!(u — v). Hence, 
1 1 1 
Pw = 5 |M u-v]? = 5 fetal? -TMy + 5 Moy)”, 
which shows that the dual potential of an elliptic potential is also elliptic. 
Elliptic potentials enjoy the following property. 


Lemma 11.9. Let ® be an elliptic potential defined by the triple (M, v, c). Then, for all 
u,we R%, 


Do(u, w) = |M"? (u — w|”. 


1 
A 
The proof is left as an exercise. 

We now show that the time-varying potential obtained from the polynomial potential 
(u) = 5 lulj? and the square loss £(@, y) = i(p — y) is an elliptic potential. First note 
that P = &*, due to the self-duality of the 2-norm. Rewrite ®*(u) as sul] u, where / is 
the d x d identity matrix, and let P = &*. Then, following the definition of time-varying 
potential, 


7 (u) = ğu) + £u) 


= <u'Ju+ (u'x; — yi) 


ile 1 a 
= -u'/u+ 3" x xj u— y;uū' x; + F 


l r T T y? 
= -u (I +xı xj )u— yu Nig: 
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Iterating this argument, we obtain the time-varying elliptic potential 
1 t t 1 t 
l È l TY 2 
Pu = zu (1+ Xxx)u-u 2% +5 DI 


We now take a look at the update rule w, = V®, (Vor 1(Wr-1)) when ©, is the time- 
varying elliptic potential. Introduce 


t 
a= (14 3x21] for allt = 0, 1,2,... 
atl 


Using the definition of ®*, we can easily compute the gradient 


t 
V@r(u) = A,u— > Ys Xp. 


Before proceeding with the argument, we need to check that ®¥ is Legendre. Since ®F 
is defined on Rf, and V ®* computed above is continuous, we only need to verify that &* 
is strictly convex. To see this, note that A; is the Hessian matrix of ®?. Furthermore, A; is 
positive definite for all t = 0, 1, ... because 


d 
vA, v = |ivl|? + ev >0 
s=l1 
for any v € Rf, and v! A, v = Oif and only if v = 0. This implies that @* is strictly convex. 
Since ®* is Legendre, Lemma 11.5 applies, and we can invert V ®* to obtain 


t 
Vö, (u) = A; ! ( +o ys x) . 
s=l 


This last equation immediately yields a closed-form expression for w,: 
t 
WwW; = V®,(0) = A Xo Ys Xs. 


In this form, w, can be recognized as the solution of 


argmin E lull? + se Xs — Ys) | = argmin ®*(u), 
ucR?¢ ueR¢@ 
which defines the well-known ridge regression estimator of Hoerl and Kennard [162]. As 
noted in Remark 11.1, of Section 11.6, this alternative nonrecursive definition of w, is 
possible under any time-varying potential. However, notwithstanding this equivalence, it 
is the recursive definition of w, that makes it easy to prove regret bounds as witnessed by 
Theorem 11.6. 

After this short digression, we now return to the analysis of the regret. Applying the 
above formulas to w, = V®, (v *_,(w-1)), and using a little algebra, we obtain w, = 
A;! (A;-1 Wr-1 + yr Xr). We now state (proof left as exercise) a more explicit form for 
the update rule. In the rest of this chapter, we abbreviate the name of the gradient-based 
forecaster using the time-varying elliptic potential with the more concise “ridge regression 
forecaster.” 
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The next lemma, used in the analysis of ridge regression, shows that the weight update 
rule of this forecaster can be written as an instance of the Widrow-Hoff rule (mentioned in 
Section 11.4) using a real matrix (instead of a scalar) as learning rate. 


Lemma 11.10. The weights generated by the ridge regression forecaster satisfy, for each 
t>1, 


-1 (qT 
Wi = Wii — A7 (WLX — yr) X- 
Before proving the main result of this section, we need a further technical lemma. 


Lemma 11.11. Let B be an arbitrary n x n full-rank matrix, let x an arbitrary vector, and 
let A = B + xx” . Then 


det(B 
x A x=1-— A ) 
det(A) 
Proof. If x = (0, ..., 0), then the theorem holds trivially. Otherwise, we write 


B =A-xx' =A(I — Axx’). 


Hence, computing the determinant of the leftmost and rigthmost matrices, 


det(B) = det(A) det (J — A7'xx'). 
The right-hand side of this equation can be transformed as follows: 


det(A) det (7 — A7'xx') 
= det(A) det (A'/”) det (I — A~'xx") det (A~"””) 
= det(A) det (7 — AT'?xx"A7"”) . 
Hence, we are left to show that det (J — A~!/?xx'A7'/?) = 1 — x! A™' x. Letting z= 
A7'/?x, this can be rewritten as det(J — zz') = 1 —z"z. It is easy to see that z is an 
eigenvector of / —zz' with eigenvalue 4; = 1 — z'z. Moreover, the remaining d — 1 


eigenvectors uz, . . . , Ug of J — zz' form an orthogonal basis of the subspace of R? orthog- 
onal to z, and the corresponding eigenvalues Az... , Aq are all equal to 1. Hence, 


d 
det] —zz') = Į [>~ =1-z'z, 


jel 


which concludes the proof. W 


We are now ready to prove a bound on the regret for the square loss of the ridge regression 
forecaster. 


Theorem 11.7. If the ridge regression forecaster is run on a sequence (X,, y1), (X2, y2)... € 
R? x R, then, for all u € R7 and for all n > 1, the regret R,(u) defined in terms of the 
square loss satisfies 


d 
1 
Rw = 5 lull? + (> In (1 + ao) max €:(W:-1), 


e T E 


where àı, ... , àq are the eigenvalues of the matrix xı X} + +- +X, X, . 
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Proof. Note that V5(0) = V5 \|0||? = 0. Hence, Theorem 11.6 can be applied. Using 
the nonnegativity of Bregman divergences, we get 


n 
R,(u) < Da; (u, Wo) + È Do; (Wii, Wr), 
t=1 


where Dox(U, Wo) = 5 lul]? because wo = 0. 
Using Lemmas 11.9 and 11.10, we can write 


1 = 
Do(Wr-1, w,) = zW- —W,) AW- — Wr) 


1 
= 5 W- Wi) (w 1X) — yr) X 


1 2 7, 
z (w iX; — yr) x, A, x, 


= £,(w;-1) x; Ay!x,. 


Applying Lemma 11.11, we get 


n 


a ae __ det(A;_1) 
2% A, X= (1 IAN. ) 


t=1 


det(A 
< yie (because 1 — x < — lnx for all x > 0) 
oy delAm) 
det(A,) 
det(Ag) 


d 
= Doin +A), 
i=1 


where the last equality holds because det(Ay) = det(/) = 1 and because det(A,,) = (1 + 
A) X +++ X (1 + àa), where 4;,..., Aq are the eigenvalues of the d x d matrix A, — I = 
xı X] +---+x,x,. Hence, 


n d 
> ew) x! Ap x, < (> In(1 + ao) max L(Wi) 


t=1 j=) 0 7 


this concludes the proof. W 


Theorem 11.7 is somewhat disappointing because we do not know how to control the 
term max, €,(w;_,). If this term were bounded by a constant (which is certainly the case if the 
pairs (x,;, y,) come from a bounded subset of R? x R), then the regret would be bounded by 
O (Inn), an exponential improvement over the forecasters using fixed potentials! To see this, 
choose X such that ||x;|| < X forallt = 1,...,. A basic algebraic fact states that A, — I 
has thesame nonzero eigenvalues as the matrix G with entries G;,; = x; x; (G is called the 
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nX?. The quantity (1 + 1) x... x (1 + àq), under the constraint 4; ++- + Àa < nX?, 
is maximized when A; = n X?/d for each i. This gives 


Gram matrix of the points X1, .. . , Xn). Therefore A; +--+ + Ag = X] Xi +++: +x! x, < 


d 


nX? 
XO na +4;) <d In(1+ ae 


i=1 


11.8 A Nonlinear Forecaster 


In this section we show a variant of the ridge regression forecaster achieving a logarithmic 
regret bound with an improved leading constant. In Section 11.9 we show that this constant 
is optimal. 

We start from the nonrecursive definition, given in Section 11.7, of the gradient-based 
forecaster using the time-varying elliptic potential (i.e., the ridge regression forecaster) 


1 2,1 — T 2 
-1 = argmin | = =) s — ys)? | = argmin &*_,(u). 
W1 sani | 5 llull + 5 (U Xs — ys) argmin ®*_,(u) 


s=l ucR¢ 


We now introduce the Vovk—Azoury—Warmuth forecaster, introduced by Vovk [300], as an 
extension to linear experts of his aggregating forecaster (see Sections 3.5 and 11.10), and 
also studied by Azoury and Warmuth [20] as a special case of a different algorithm. The 
Vovk—Azoury—Warmuth forecaster predicts at time tf with W, X, where 


= : 1 2 cc T 2, l T 2 
Wen ED +5 Le Xs — Ys) +50 x) |. 


Note that now the weight W, used to predict x; has index ¢ rather than ¢ — 1. We use this, 
along with the “hat” notation, to stress the fact that now W, does depend on x,. Note that 
this makes the Vovk—Azoury—Warmuth forecaster nonlinear. 

Comparing this choice of W, with the corresponding choice of w,_; for the gradient- 
based forecaster, one can note that we simply added the term $(u'x,)*. This additional 
term can be viewed as the loss 5(u'x, — yo, where the outcome y,, unavailable when W, 
is computed, has been “estimated” by 0. 

As we did in Section 11.7, it is easy to derive an explicit form for this new W, also: 


t-1 

a -1 

wW = A, > YsXs. 
s=1 


We now prove a logarithmic bound on the regret for the square loss of the Vovk—Azoury— 
Warmuth forecaster. Though its proof heavily relies on the techniques developed for proving 
Theorems 11.6 and 11.7, it is not clear how to derive the bound directly as a corollary of 
those results. On the other hand, the same regret bound can be obtained (see the proof 
in Azoury and Warmuth [20]) using the gradient-based linear forecaster with a modified 
time-varying potential, and then adapting the proof of Theorem 11.6 in Section 11.6. We 
do not follow that route in order to keep the proof as simple as possible. 
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Theorem 11.8. If the Vovk—-Azoury—Warmuth forecaster is run on a sequence 
(X1, y1), (Xo, y2),--. € R? xR, then, for all u € R7 and foralln > 1, 


sosa Ee 


' a i 14k 
u — ln y 
2 d 


n |y:l, and à1, . . . , àq are the eigenvalues of the 


where X = max;—| 


a T T 
matrix Xı X; +++++XnX,. 


se WAT TEs E m MHAE 1,..., 


Proof. For convenience, we introduce the shorthand 


t 
a, = 5 YtXı. 
s=1 


In what follows, we use W, = Alanı to denote the weight, at time t, of the Vovk- 
Azoury—Warmuth forecaster, and w;_; = Atl ar to denote the weight, at time t, of the 
ridge regression forecaster. A key step in the proof is the observation that, for all u € R¢, 


1 1 
L,(u) > inf ®*(v) — = lul? = *(w,) — = lul? , 
veR? 2 2: 


where we recall that L, (u) = €;(u) +--- + £, (u). Hence, the loss of an arbitrary linear 
expert is simply bounded by the potential of the weight of the ridge regression forecaster. 
This can be exploited as follows: 


R,(u) =, — L,(u) 


a~ 


1 2 @* 
S La + 5 Mull’ — ©, (Wn) 


= X (L) + ®t (w1) — ®F(w,)) 


= Do (L@) — &(w:-1)) + D> Do; (Wi, Wi), 
t=1 t=1 


where in the last step we used the equality 
Li (Wi) = Do; (Wii, Wi) + OF (w,) — DW) 


established at the beginning of the proof of Theorem 11.6. Note that we upper bounded the 
regret R,,(u) with the difference 


Ehe) uw) 


t=1 


between the Vovk—Azoury—Warmuth forecaster and the ridge regression forecaster plus a 
sum of divergence terms. 
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Now, the identities A, — A;_) = X; X 
A>! = A7 'x, x] A7!, together imply that 


T -1 e= al Tae —1 
t? Ay ET A; = A,X X, A; and Arı F 


AH —A, Es A, x, x, A; I> (x! ALX) A,X; x) Ap. 
Using this and the definition of the time-varying elliptic potential (see Section 11.7), 


1 Ñ 
PIW) = Sw) AW — w/a + 5 ye, 


s=l 
we prove that 
£,(W,) — L (w1) + Do: (Wi-1, Wr) 
2 
easy 


2 
y = 1 - = 
= aX A x, = 3 (x; Az')x;) (wx)? < z * A, 'X,, 


where the term dropped in the last step is negative (recall that A; is positive definite implying 
that A7! also is positive definite). Using Lemma 11.11 then gives the desired result. I 


As a final remark, note that, using the Sherman—Morrison formula, 


(4721x) Ana 
1 +x Ax 


Ap =A 


The d x d matrix A}, where A; = A1 + X; x; can be computed from A in time 
©(d?). This is much better than @(d?) required by a direct inversion of A;. Hence, both the 
ridge regression update and the Vovk—Azoury—Warmuth update can be computed in time 
©(d?). In contrast, the time needed to compute the forecasters based on fixed potentials is 
only ©(d). 


11.9 Lower Bounds 


We now prove that the Vovk—Azoury—Warmuth forecaster is optimal in the sense that its 
leading constant cannot be decreased. 


Theorem 11.9. Let £ be the square loss &(p, y) = (p — y)*. For alld > 1, forall Y > 0, 
for all e > 0, and for any forecaster, there exists a sequence (X1, y1), (X2, Y2),... € R4 x 
[-Y, Y], with ||x;|| = 1 fort = 1,2,..., such that 

L, = infuer (Law + llul?) y? 


lim inf d 
RO inn 2d —s)> 


Proof. We only prove the theorem in the case d = 1; the easy generalization to an arbitrary 
d > Lis left as exercise. Without loss of generality, set Y = 1/2 (the bound can be rescaled 
to any range). Let x, = 1 for t = 1, 2,..., so that all losses are of the form iw — yy. 
Since this value does not change if the same constant is added to w and y, without loss 
of generality we may assume that y; € {0, 1} for all £ (instead of y, € {— 1/2, 1/2}, as 
suggested by the assumption Y = 1/2). 
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Let L(F, y”) be the cumulative square loss of forecaster F on the sequence 


(l, yi),.--, (1, yn) and let L(u, y”) be the cumulative square loss of expert u on the same 
sequence. Then 


inf max (LE, y") = inf (L(u, y") + u’)) 
F y u 


> inf E[ LCF. Yi, -+ Yn) — inf (L, Yi, -> Yn) +4°)], 


where the expectation is taken with respect to a probability distribution on {0, 1}” defined 
as follows: first, Z € [0, 1] is drawn from the Beta distribution with parameters (a, a), 
where a > 1 is specified later. Then each Y, is drawn independently from a Bernoulli 
distribution of parameter Z. It is easy to show that the forecaster F achieving the infimum 
of E L(F, Y1, ..., Yn) predicts at time ¢ + 1 by minimizing the expected loss on the next 
outcome Y;,; conditioned on the realizations of the previous outcomes Y4, ...,Y,. The 
prediction P;+ı minimizing the expected square loss on Y;., is simply the expected value 
of Y;4; conditioned on the number of times S, = Yı +---+ Y, the outcome | showed up 
in the past. Using simple properties of the Beta distribution (see Section A.1.9), we find 
that the conditional density fz(p | S; = k) of Z, given the event S, = k, equals 


pea _ pre 
Biatk,a+t—k)’ 


fz(p |S; = k) = 


where B(x, y) = P(x) FP (y)/ TP (x + y) is the Beta function. Therefore, 


1 
TAE Ma 1S =k = f Yeu |Z = pl fe(p |S = Bap 
0 


a f p™ta <3) pre 
0 B(a+k,a+t-— k) 
k+a 
t +2a` 


Let E, be the conditional expectation E[- | Z = p]. Then, 


ip [Pri — Y] = Ep [Pr — pY] +E, [Y — p] 


S 2 
= (B -?) ]rea-» 


S +a pt+a A ptt+a $ 
=E, = = 1— 
E eta) EP t + 2a EE) 


(since ES, = pt) 
Sı — pt 2ap —a 
t+ 2a t+ 2a 


_ tpd— p) 2ap — a 
~ (t +2a} t+ 2a 


“p 


2 
) +p- p) 


2 
) + pd — p). 
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Hence, recalling that a > 1, the expected cumulative loss of the optimal forecaster F is 


computed as 


n—1 


rai = 1 7} : 
LF Yin ¥o) = EIZA- 2 ae 
pl [eaz -aF 1 
2 a (t + 2a) 
1 n-1 
E 2 [Z(G — Z)] 
bd aza- f d 
=g 1 (t + 2a)? 
a 1 nl dt 
ay (a3 s ru) 
y 2c ; 
5 3 [Z(1 — Z)]. 


We now lower bound the three terms on the right-hand side. The integral in the first term 
equals 
n—-1+2a 
1+ 2a 


2a(n — 2) z 
G42an—1424) 
where, here and in what follows, ©(1) is understood for a constant and n — oo. As the 
entire second term is @(1), using E[Z (1 — Z)] = a/(4a + 2), we get 

a n—1+2a an 
4Qa+t) 142a | 4Qa4+1) 


Now we compute the cumulative loss of the best expert. Recalling that S, = Y; +---+Yp, 
we have 


n—1+2a 
1+2a 


n 


+ 0(1), 


SL(F, Yi, ..., Y) = 


+ @(1). 


lp [im - r| 


t=1 


n 2 n 2 1 
<1 eD ` y ~ o2 
= pY; = > p ( ‘| Ts Lp Sas 


t=1 


Hence, recalling that the variance of any random variable X is E X? — (Œ X)’, and that the 


variance of the Binomial random variable S$, is np(1 — p), we get 
i E Su = r| 
t=1 


2 1 
= np — (pa — p) + (např) + —(np(. — p) + (np)’) 


n 


=np(l — p) — p(l — p). 
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Integrating over p yields 


|. i = 2] _ an 
far 13 |= 22 qp ~ Pil - P). 


Therefore, as 0 < u < 1, 


[LCF Ya,- -+ Yn) — inf (LG, Yas - -s Yn) +07) | 


a n—1+2a P an an 
= n 
4(2a + 1) 1 + 2a 4(2a+1) 42a+l1) 


1 a ay eee aL Ye 
= n $, 
2441/8” 142a 


We conclude the proof by observing that the factor multiplying ln n can be made arbitrarily 
close to 1 by choosing a large enough. W 


+ 0(1) 


11.10 Mixture Forecasters 


We have seen that in the case of linear experts, regret bounds growing logarithmically with 
time are obtainable for the squared loss function. The purpose of this section is to show that 
by a natural extension of the mixture forecasters introduced in Chapter 9, one may obtain a 
class of algorithms that achieve logarithmic regret bounds under general conditions for the 
logarithmic loss function. 

We start by describing a general framework for linear prediction under the logarithmic 
loss function. The basic setup is reminiscent of sequential probability assignment intro- 
duced in Chapter 9. The model is extended to allow side information and experts that 
depend on “squashed” linear functions of the side-information vector. To harmonize nota- 
tion with Chapter 9, consider the following model. At each time instance t, before making 
a prediction, the forecaster observes the side-information vector x, such that ||x;|| < 1. The 
forecaster, based on the side-information vector and the past, assigns a nonnegative number 
P(Y, X;) to each element y of the outcome space V. Note that, as opposed to the rest of the 
section, we do not require that Y be a subset of the real line. The function 7; (-, x;) is some- 
times interpreted as a “density” over V, though Y does not even need to be a measurable 
space. The loss at time ¢ of the forecaster is defined by the logarithmic loss — In P; (yr, X+), 
and the corresponding cumulative loss is 


L, =— n] | POr, X;). 


t=1 


Just as in earlier sections in this chapter, each expert is indexed by a vector u € R, and 
its prediction depends on the inner product of u and the side-information vector through 
a transfer function. Here the transfer function o : Y x R —> R is nonnegative, and the 
prediction of expert u, on observing the side information x,;, is the “density” o(-, u- x;). 
The cumulative loss of expert u is thus 


L,(u) = =m] [oQ u: x). 


t=1 
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In this section we consider mixture forecasters of the form 
Pox) = f o(y.u-x quae, 
where q; is a density function (i.e., a nonnegative function with integral 1) over Rf defined, 
fort = 1,2,..., by 
—L,-1(u) 


Pe qo(uye Z qo) JIZ] Os, u- Xs) 
' Soe dv f gow) TT) os, v Xs) dv 


where go denotes a fixed initial density. Thus, p; is the prediction of the exponentially 
weighted average forecaster run with initial weights given by the “prior” density qo. 


Example 11.8 (Square loss). Assume now that yY is a subset of the real line. By considering 
the “gaussian” transfer function 


o(y,u-x)= wn 


vIn 


the logarithmic loss of expert u becomes lIn V27 + $(u -x; — y;)*, which is basically 
the square loss studied in earlier sections. However, the logarithmic loss of the mixture 
forecaster P;(y;, X;) does not always correspond to the squared loss of any vector-valued 
forecaster w,. If the initial density go is the multivariate gaussian density with identity 
covariance matrix, then it is possible to modify the mixture predictor such that it becomes 
equivalent to the Vovk—Azoury—Warmuth forecaster (see Exercise 11.18). 


The main result of this section is the following general performance bound. 


Theorem 11.10. Assume that the transfer function o is such that, for each fixed y € Y, the 
function F(z) = —Ino(y, z), defined for z € R, is twice continuously differentiable, and 
there exists a constant c such that IF (2)| < c for allz € R. Letu € R’, £ > 0, and let q£ 
be any density over R? with mean J vgi(v) dv = u and covariance matrix e?°I. Then the 
regret of the mixture forecaster defined above, with respect to expert u, is bounded as 


a nce? 7 
La, —L,@) < a + D(qallqo), 


where 


qay) 
qo(v) 


is the Kullback-Leibler divergence between qj, and the initial density qo. 


dv 


D(qullgo) = [aw In 


Proof. Fix u € Rf, and let që be any density with mean u and covariance matrix ¢7/. In 
the first step of the proof we relate the cumulative loss L, to the averaged cumulative loss 


L, qE) = f L,(v)ge(v) dv. 


11.10 Mixture Forecasters 327 


First observe that by Taylor’s theorem and the condition on the transfer function, for any 
y € Y andz, zo € R, 


F,(2) < Fy(20) + Fyo)(z — 20) + ZE — z0). 


Next we apply this inequality for z = V-x, and zo = E[V]-x;, where V is a random 
variable distributed according to the density gj. Noting that z9 = u - x;, and taking expected 
values on both sides, we have 


EFV x) < Fux) + 50%, 


where we used the fact that var(x - V) = a var(V;)x? = e? ae x? < °. Observing 
that 


X Fu x)= L and = EF, Vx) = L (98) 


t=1 t=1 


we obtain 


g nce? 
LiA(gy) < L,(u) + “37° 


To finish the proof, it remains to compare Le with L,,(q,;,). By definition of the mixture 
forecaster, 


L, — Laai = — In] [P 0x) + [corn] Joo. v-x,)dv 
t=1 t=1 


= f aom HEOI D gy 
i Pir, Xr) 


e Ti o Orn Vx) 
= l 
[ao ‘ J gow) TT, or, W > X))dw ; 


= qa (Y) In an) dv (by definition of qn) 


qoy) 

= [aw In qay) iga [ao In qa(v) po 
qo(v) qn (V) 

= D(qallqo) — D (qalla) 


< D(qallqo), 


where at the last step we used the nonnegativity of the Kullback—Leibler divergence (see 
Section A.2). This concludes the proof of the theorem. W 


Remark 11.3. The boundedness condition of the second derivative of the logarithm of 
the transfer function is satisfied in several natural applications. For example, for the gaus- 
sian transfer function o(y, z) = (1/27 e~ @-»)/2, we have F(z) = 1 for all y,z € R. 
Another popular transfer function is the one used in logistic regression. In this case 
VX = {-1, 1}, and ø is defined by o(y,u- x) = 1/(1 +e"), It is easy to see that 
|FY(z)| < 1 for both y = —1, 1. 
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Note that the definition of the mixture forecaster does not depend on the choice of gj, so 
that we are free to choose this density to minimize the obtained upper bound. Minimization 
the Kullback—Leibler divergence given the variance constraint is a complex variational 
problem, in general. However, useful upper bounds can easily be derived in various special 
cases. Next we work out a specific example in which the variational problem can be solved 
easily and gj can be chosen optimally. See the exercises for other examples. 


Corollary 11.1. Assume that the mixture forecaster is used with the gaussian initial density 
go(u) = (27)~4/2e~lll’/2. If the transfer function satisfies the conditions of Theorem 11.10, 
then for any u € Rf, 


A ALAE (+2) 
2 2 d 
Proof. The Kullback—Leibler divergence between go and any density gj, with mean u and 
covariance matrix £°7 equals 


D(qullgo) = al 


ici Gua ony a+ fa cle ay 


g 3 lul? de? 
= WV) In ga(v) dv + 5 qn) + ME pt 


The first term on the right-hand side is a the es of the differential entropy of 
the density q)(v). It is easy to see (Exercise 11.19) that among all densities with a given 
covariance matrix, the differential entropy is maximized for the gaussian density. Therefore, 
the best choice for gj (v) is the multivariate normal density with mean u and covariance 
matrix ¢7/. With this choice, 


d 
f qE(v) Ingé(v) dv = Ša In(2ree?) 


and the statement follows by choosing £ to minimize the obtained bound. E 


11.11 Bibliographic Remarks 


Sequential gradient descent can be viewed as an application of the well-known stochastic 
gradient descent procedure of Tsypkin [291] (see Bottou and Murata [39] for a survey) 
to a deterministic (rather than stochastic) data sequence. The gradient-based linear fore- 
caster was introduced with the name general additive regression algorithm by Warmuth 
and Jagota [305] (see also Kivinen and Warmuth [183]) for regression problems, and 
independently with the name quasi-additive classification algorithm by Grove, Littlestone, 
and Schuurmans [133] for classification problems. Potentials in pattern recognition have 
been introduced to describe, in a single unified framework, seemingly different algorithms 
such as the Widrow—Hoff rule [310] (weights updated additively) and the exponentiated 
gradient (EG) of Kivinen and Warmuth [181] (weights updated multiplicatively). The frame- 
work of potential functions enables one to view both algorithms as instances of a single 
algorithm, the gradient-based linear forecaster, whose weights are updated as in the dual 
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gradient update. In particular, the Widrow—Hoff rule corresponds to the gradient-based 
linear forecaster applied to square loss and using the quadratic potential (Legendre poly- 
nomial potential with p = 2), while the EG algorithm corresponds to the forecaster of 
Theorem 11.3. As it has been observed by Grove, Littlestone, and Schuurmans [133], the 
polynomial potential provides a parameterized interpolation between genuinely additive 
algorithms and multiplicative algorithms. Earlier individual sequence analyses of additive 
and multiplicative algorithms for linear experts appear in Foster [103], Littlestone, Long, 
and Warmuth [202], Cesa—Bianchi, Long, and Warmuth [50], and Bylander [43]. For an 
extensive discussion on the advantages of using polynomial vs. exponential potentials in 
regression problems, see Kivinen and Warmuth [181]. 

Gordon [131] develops an analysis of regret for more general problems than regression 
based on a generalized notion of Bregman divergence. 

The interpretation of the gradient-based update in terms of iterated minimization of 
a convex functional was suggested by Helmbold, Schapire, Singer, and Warmuth [157]. 
However, the connection with convex optimization is far from being accidental. An analog 
of the dual gradient update rule was introduced by Nemirovski and Yudin [223] under 
the name of mirror descent algorithm for the iterative solution of nonsmooth convex 
optimization problems. The description of this algorithm in the framework of Bregman 
divergences is due to Beck and Teboulle [24], who also propose a version of the algorithm 
based on the exponential potential. In the context of convex optimization, the iterated 
minimization of the functional 


min [Do (u, w1) + r6,(w)| 
ucRi 


(which we used to motivate the dual gradient update) corresponds to the well-known 
proximal point algorithm (see, e.g., Martinet [210] and Rockafellar [248]). A version of the 
proximal point algorithm based on the exponential potential was proposed by Tseng and 
Bertsekas [290]. 

Theorem 11.1 is due to Warmuth and Jagota [305]. The use of transfer functions in this 
context was pioneered by Helmbold, Kivinen, and Warmuth [153]. The good properties of 
subquadratic pairs of transfer and loss functions were observed by Cesa-Bianchi [46]. A dif- 
ferent approach, based on the notion “matching loss functions,” uses Bregman divergences 
to build nice pairs of transfer and loss functions and was investigated in Haussler, Kivinen, 
and Warmuth [153]. In spite of its elegance, the matching-loss approach is not discussed 
here because it does not fit very well with the proof techniques used in this chapter. 

The projected gradient-based forecaster of Section 11.5 has been introduced by Herbster 
and Warmuth [160]. They prove Theorem 11.4 about tracking linear experts. The self- 
confident forecaster has been introduced by Auer, Cesa-Bianchi, and Gentile [13], who 
also prove Theorem 11.5. 

The time varying potential for gradient-based forecasters of Section 11.6 was introduced 
and studied by Azoury and Warmuth [20], who also proved Theorem 11.6. The forecaster 
based on the elliptic potential, also analyzed in [20], corresponds to the well-known ridge 
regression algorithm (see Hoerl and Kennard [162]). An early analysis of the least-squares 
forecaster in the case where y; = ul x, + £, where u € Rf is an unknown target vector and 
€, are i.i.d. random variables with finite variance, is due to Lai, Robbins, and Wei [190]. 
Lemma 11.11 is due to Lai and Wei [191]. Theorem 11.7 is proven by Vovk in [300]. A 
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different proof of the same result is shown by Azoury and Warmuth in [20]. The proof 
presented here uses ideas from Forster [102] and [20]. 

The derivation and analysis of the nonlinear forecaster in Section 11.8 is taken from 
Azoury and Warmuth [20], who introduced it as the “forward algorithm.” In [300], Vovk 
derives exactly the same forecaster generalizing to continuously many experts the aggre- 
gating forecaster described in Section 3.5, where the initial weights assigned to the linear 
experts are gaussian. Using different (and somewhat more complex techniques) Vovk also 
proves the same bound as the one proven in Theorem 11.8. The first logarithmic regret 
bound for the square loss with linear experts is due to Foster [103], who analyzes a variant 
of the ridge regression forecaster in the more specific setup where outcomes are binary, the 
side-information elements x, belong to [0, 1]“, and the linear experts u belong to the prob- 
ability simplex in R“. The lower bound in Section 11.9 on prediction with linear experts 
and square loss is due to Vovk [300] (see also Singer, Kozat, and Feder [268] for a stronger 
result). 

Mixture forecasters in the spirit of Section 11.10 were considered by Vovk [299, 300]. 
Vovk’s aggregating forecaster is, in fact, a mixture forecaster and the Vovk—Azoury— 
Warmuth forecaster is obtained via a generalization of the aggregating forecaster. Theo- 
rem 11.10 was proved by Kakade and Ng [173]. A logarithmic regret bound may also be 
derived by a variation on a result of Yamanishi [314], who proved a general logarithmic 
bound for all mixable losses and for general parametric classes of experts using the aggre- 
gating forecaster. However, the resulting forecaster is not computationally efficient. It is 
worth pointing out that the mixture forecasters in Section 11.10 are formally equivalent 
to predictive bayesian mixtures. In fact, if go is the prior density, the mixture predictors 
are obtained by bayesian updating. Such predictors have been thoroughly studied in the 
bayesian literature under the assumption that the sequence of outcomes is generated by one 
of the models (see, e.g., Clarke and Barron [62,63]). The choice of prior has been studied 
in the individual sequence framework by Clarke and Dawid [64]. 


11.12 Exercises 


11.1 Prove Lemma 11.1. 
11.2 Prove that Lemma 11.3 holds with equality whenever S is a hyperplane. 


11.3 Consider the modified gradient-based linear forecaster whose weight w, at time ¢ is the solution 
of the equation 


w = argmin| De» (u, W;_1) +A e,(u) |. 


ucRi 


Note that this amounts to not taking the linear approximation of £, (u) around w,—1, as done in 
the original characterization of the gradient-based linear forecaster expressed as solution of a 
convex optimization problem. 

Prove that a solution to this equation always exists and then adapt the proof of Theorem 11.1 
to prove a bound on the regret of the modified forecaster. 


11.4 Prove that the weight update rule 


Wipe VOD 


Wit = 
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corresponds to a Bregman projection of w; = w; -1 67> 6 W-D i = 1, ..., d, onto the prob- 
ability simplex in R?, where the projection is taken according to the Legendre dual 


d 
@*(u) = ) > uj(Inu; — 1) 
i=l 


of the potential ®(u) = e"! + --- +e". 
Consider the Legendre polynomial potential ®,. Show that if the loss function is the square 
loss £(p, y) = ip — y)’, then for all w, u € R, for all y € R, and for all c > 0, 


2 
al -x, y)— Lu- x) < e (Do, U, w) — Da, (u, w')), 

where w = ®,(®,(w) — AV E(w) is the dual gradient update, and ņ = c/(a +c)Xp-— 
1) IxIiĝ) (Kivinen and Warmuth [181], Gentile [124].) Warning: This exercise is difficult. 
(Continued) Use the inequality stated in Exercise 11.5 to derive a bound on the square loss 
regret R,(u) for the gradient-based linear forecaster using the identity function as transfer 
function. Find a value of c that yields a regret bound for the square loss slightly better than 
that of Theorem 11.2. 


Derive a regret bound for the gradient-based forecaster using the absolute loss €(p, y) = 
|p — yl. Note that this loss is not regular as it is not differentiable at P = y (Cesa-Bianchi [46], 
Long [205].) 

Prove a regret bound for the projected gradient-based linear forecaster using the hyperbolic 
cosine potential. 


Prove Lemma 11.8. Hint: Set lọ = a and prove, for each t = 1,...,, the inequality 


Prove an analogue of Theorem 11.4 using the Legendre exponential potential and projecting 
the weights to the convex set obtained by intersecting the probability simplex in R? with the 
hypercube [6 /d, 1], where £ is a free parameter (Herbster and Warmuth [160]). 


Prove Lemma 11.9. 
Show that 


1. the nice pair of Example 11.6 is 2-subquadratic, 
2. the nice pair of Example 11.7 is 1/4-subquadratic. 


Consider the alternative definition of gradient-based forecaster using the time-varying potential 
w, = argmin Ọž (u), 
ucR¢4 


where 


Džu) = Do (u, Wo) + >> E(u) 


s=l 


for some initial weight wo and Legendre potential ® = ®ọ. Using this forecaster, show a 
more general version of Theorem 11.6 proving the same bound without the requirement 
V ®*(w,) = 0 for all £ > 0 (Azoury and Warmuth [20]). 

Prove Lemma 11.10. 

(Follow the best expert) Consider the online linear prediction problem with the square loss 
(W, - X,, Y) = (W: < X; — y). Consider the follow-the-best-expert (or least-squares) forecaster 
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11.18 


11.19 


11.20 
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with weights 


t 
WwW = argmin you Xs — ys) 


uceR? s] 


Show that 


t -1l t 
T 
WwW = ) XsX, ) YsXs 
s=l s=1 
T 


whenever xix, +--+ + x;x/ is invertible. Use the analysis in Section 3.2 to derive logarithmic 
regret bounds for this forecaster when y, € [—1, 1]. What conditions do you need for the x,? 


Provide a proof of Theorem 11.9 in the general case d > 1. 


By adapting the proof of Theorem 11.9, show a lower bound for the relative entropy loss in 
the univariate case d = 1. (Yamanishi [314].) 

Consider the mixture forecaster P, (-, x,) of Section 11.10 with the gaussian transfer function. 
Assume that the gaussian initial density gg(u) = (277)~4/2e~""""/2 is used. Assume that Y = 
[—1, 1]. Show that the forecaster w, defined by 


_ BAH1,) -PU x) 
4 
is just the Vovk—Azoury—Warmuth forecaster (Vovk [300]). 


t 


Show that for any multivariate density q on R? with zero mean f ug(u) du = 0 and covariance 
matrix K , the differential entropy h(g) = — f q(u) In q (u) du satisfies 


d 1 
AMS 3 In(2re) + > In det(K) 
and equality is achieved by the multivariate normal density with covariance matrix K (see, 


e.g., Cover and Thomas [74]). 


Consider the mixture predictor in Section 11.10 and choose the initial density qo to be uniform 
in a cube [—B, B]¢. Show that for any vector u with |lul|,, < B — Gd/nc)!”, the regret 


satisfies 
Tornoe nee Pai 
n` Cn Ss 7m nb. 
ee Be ag 


Hint: Choose the auxiliary density q to be uniform on a cube centered at u. 


12 


Linear Classification 


12.1 The Zero—One Loss 


An important special case of linear prediction with side information (Chapter 11) is the 
problem of binary pattern classification, where the decision space D and the outcome space 
Y are both equal to {—1, 1}. To predict an outcome y; € {—1, 1}, given the side information 
x, € Rf, the forecaster uses the linear classification P = sgn(w;_1 -X,), where w,_; is a 
weight vector and sgn(-) is the sign function. In the entire chapter we use the terminology 
and notation introduced in Chapter 11. 

A natural loss function in the framework of classification is the zero—one loss L, y) = 
Tsz) counting the number of classification mistakes Y Æ y. Since this loss function is 
not convex, we cannot analyze forecasters in this model using the machinery developed in 
Chapter 11. A possibility, which we investigate in Chapter 4 for arbitrary losses, is to allow 
the forecaster to randomize his predictions. As the expected zero—one loss is equivalent to 
the absolute loss Sly — p|, where y € {—1, l} and p € [—1, +1], we see that randomization 
provides a convex variant of the original problem, which we can study using the techniques 
of Chapter 11. In this chapter we show that, even in the case of deterministic predictions, 
meaningful zero—one loss bounds can be derived by twisting the analysis for convex 
losses. 

Let P be a real-valued prediction used to determine the forecast Y by Y = sgn(p), and 
then consider a regular (and thus convex) loss function £ such that Ijp2,) < L(p, y) for 
all p € R and y € {—1, 1}. Now take any forecaster for linear experts, such as one of the 
gradient-based forecasters in Chapter 11. The techniques developed in Chapter 11 allow 
one to derive bounds for the regret Ln — L,,(a), where 


L= Poy) and L, = Y eux, y). 
t=1 t=1 


Since 52); < €(p, y), we immediately obtain a bound on the “regret” 


n 
Yo Izy) — Law) 
t=1 


of the forecaster using classifications of the form Y; = sgn(p;). Note that this notion of 
regret evaluates the performance of the linear reference predictor u with a loss function 
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-1 0 1 2 


Figure 12.1. A plot of the square loss £(p, y) = (P — y)? for y = 1. This loss upper bounds the 
zero—one loss. The “normalized” hinge loss (1 — yp/y)+ is another convex upper bound on the 
zero—one loss (see Section 12.2). The plot shows (1 — p/y), for y = 3/2. 


larger than the one used to evaluate the forecaster Y. This discrepancy is inherent to our 
analysis, which is largely based on convexity arguments. 

An example of the results we can obtain using such an argument is the following. 
Consider the square loss ¢(p, y) = (P — y)*. This loss is regular and upper bounds the 
zero—one loss (see Figure 12.1). If we predict each binary outcome y, using Y; = sgn(p;), 
where P, is the Vovk—-Azoury—Warmuth forecaster (see Section 11.8), then Theorem 11.8 
immediately implies the bound 


n 


n d 
D Ipem < Dex, = ye? + lul? + YO ind +a), 
t=1 


t=1 i=1 


where Aj,..., Aq are the eigenvalues of the matrix x; x} +--+ + Xn xX, . This bound holds 


n 
for any sequence (x1, y1), (X2, y2),... € R? x {—1, 1} and for all u € R°. 

In the next sections we illustrate a more sophisticated “variational” approach to the 
analysis of forecasters for linear classification. This approach is based on the idea of finding 
a parametric family of functions that upper bound the zero—one loss and then expressing 
the regret using the best of these functions to evaluate the performance of the reference 
forecaster. 

The rest of this chapter is organized as follows. In Section 12.2 we introduce the hinge 
loss, which we use to derive mistake bounds for forecasters based on various potentials. In 
Section 12.3 we show that by modifying the forecaster based on the quadratic potential, we 
can obtain, after a finite number of weight updates, a maximum margin linear separator for 
any linearly separable data sequence. In Section 12.4 we extend the label efficient setup to 
linear classification and prove label efficient mistake bounds for several forecasters. Finally, 
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Section 12.5 shows that some of the linear forecasters analyzed here can perform nonlinear 
classification, with a moderate computational overhead, by implicitly embedding the side 
information into a suitably chosen Hilbert space. 


12.2 The Hinge Loss 


In general, the convex upper bound on the zero—one loss yielding the tightest approx- 
imation of the overall mistake count depends on the whole unknown sequence of side 
information and outcome pairs (x;, y+), t = 1, 2, .... In this section we introduce the hinge 
loss, a parameterized approximation to the zero—one loss, and we show simple forecast- 
ers that achieve a mistake bound that strikes an optimal tradeoff for the value of the 
parameter. 

Following a consolidated terminology in learning theory, we call instance any side 
information x, and example any pair (x, y), where y is the label associated with x. 

The hinge loss, with hinge at y > 0, is defined by 


£p, y) = (y — py)+ (the hinge loss), 


where p € R, y € {—1, 1}, and (x); denotes the positive part of x. Note that £, (p, y) is 
convex in p and £,/y is an upper bound on the zero—one loss (see Figure 12.1). Equipped 
with this notion of loss, we derive bounds on the regret 


n 1 n 
Sloan z inf cat a £,(u Xi, yr) 
t=1 ee ar 


that hold for an arbitrarily chosen u € R¢. 

We develop forecasters for classification that adopt a conservative updating policy. 
This means that the current weight vector w,_; is updated only when J; Æ y,. So, the 
prediction of a conservative forecaster at time ¢ only depends on the past examples 
(Xs, ys), for s < t such that Y; Æ ys. The philosophy behind this conservative policy is 
that there is no reason to change the weight vector if it has worked well at the last time 
instance. We focus on conservative gradient-based forecasters whose update is based on 
the hinge loss. Recall from Section 11.3 that the gradient-based linear forecaster computes 
predictions P; = w;_1 - X;, where the weights w,_; are updated using the dual gradient 
update 


Vb*(w,) = V*(w,_1) — AV Ly (W1), 


where V£, (w) = Vé,(w- Xx, yi). Technically, ¢,(p, y) is not differentiable at p = y/y. 
However, since our algorithms are conservative, the derivative of ¢,,(-, y) is computed only 
when p and y have different signs. Thus, we may set V£, (w) = —y, X, ly4y,}, where 
Ş = sgn(w - x,). 

The conservative gradient-based forecaster for classifications is spelled out, for a Leg- 
endre potential ®, here. For the definition of a Legendre potential function ® and its dual 
®* see Section 11.2. 
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THE CONSERVATIVE FORECASTER 
FOR LINEAR CLASSIFICATION 


Parameters: learning rate A > 0, Legendre potential ®. 
Initialization: wọ = V ®(0). 
For each roundt = 1,2,... 


(1) observe x,, set D; = W;—1 - X;, and predict Y, = sen(p;); 

(2) get y; € {-1, 1}; 

(3) if Y; Æ yn, then let w, = Vb(Vb*(w,-1) + Ayx); 
else let w, = W;_1. 


A direct application of the results from Chapter 11 would not serve to prove regret bounds 
where y is set optimally. We take a different route that can be followed in the case of 
any gradient-based forecaster using the hinge loss with a conservative updating policy. A 
basic inequality for such forecasters, shown in the proof of Theorem 11.1 using Taylor’s 
theorem, is 


A (ly (W1) = L, (0) < Au — w1): (—Véy,.(wr-1)) 
= Da (u, W-1) — Da (u, w:) + Do (Wii, Wr) 
for any u € R? and any Legendre potential $. Now observe that, at any step t such that 
sgn(p;) Æ yr, the hinge loss £, (U) = (y — y: u - x;)+ obeys the inequality 
Y — ly) = y — (Y — yU X) < yu x, = U (V Ly Wi)). 
Therefore, 
A(y — £y.(W)II5,4y,) < Aus (—VeyCwr-1)) 
< Au — wi): (VLW) 
= Da (U, W-1) — Da (U, W,) + Do (Wi-1, Wr) - 
To understand the second inequality note that V£, # 0 only when sgn(p,) # y: and, in this 
case, W1 © Vly (W1) = — Yr Wr-1 - X; > 0. The equality holds when V£, , = 0 because, 
in that case, w,—1 = Wz. 
The advantage of this approach is seen as follows: if we multiply both sides of 
Ay = &.C)Ui5,499 S AU + (=V£y. (Wi) 
by an arbitrary extra parameter œ > 0, and proceed as above, we obtain the new inequality 
wA(y — €y:(w))M5,45,) < Do (wu, w1) — Do: (wu, w) + DoW, W). 


Note that on the right-hand side u is now scaled by a. Summing for t = 1, ... , n and using 
some simple algebraic manipulation, we obtain 


n 


X (aay = Do (Wi1, W)) 5,45, < CAL y nU) + Do (wu, wo). (12.1) 


t=1 
This is our basic inequality for the hinge loss, and we now apply it to different potential 
functions. 
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Polynomial Potential and the Perceptron 

Consider first the conservative forecaster for classification based on the Legendre poly- 
nomial potential ®,(u) = 5 llull;,. where p > 2. The conservative update rule for this 
forecaster can be written as 


We = VP p(V Pa (Wii) + AVX Tgrt). 


It turns out that for the polynomial potential linear classification is, in some sense, easier 
than the linear pattern recognition problem of Chapter 11. In particular, a constant learning 
rate (A = 1) is sufficient to obtain a good bound for the number of mistakes. Unfortunately, 
as we see, this is not the case for the exponential potential, which still needs a careful choice 
of the learning rate. 


Theorem 12.1. If the conservative forecaster using the Legendre polynomial potential ® , 
is run on a sequence (X1, y1), (X2, y2)... € R? x {—1, 1} with learning rate à = 1, then 
for alln > 1, for allu € R4, and for all y > 0, 


n 
Yo lst 
t=1 


En 2 X a aes 
< Lr 1p v(7 e fulg) + Jo- o (22 1n) Braw 
Y Y Y 


where X p = max;—1 


n Xl» and q = p/(p — 1) is the conjugate exponent of p. 


Note that this bound holds simultaneously for all u € R? and for all y > 0. Hence, in 
particular, it holds for the best possible y for each linear classifier u. Note also that this 
bound has the same general form as of the bound stated in Theorem 11.2 in which A is 


set optimally for each choice of y and u. So, linear classification does not require the 
self-confident tuning techniques that we used in Section 11.5. 


Proof. We start from inequality (12.1). Following the proof of Theorem 11.2, for any t 
with }, Æ yr, we upper bound D»«(w;_1, Wz) as follows 


1 
Do-(W;-1, Wr) < P-L Jve, w- DI, < Poz. 


Now apply this bound to (12.1) for u € R¢ arbitrary and with à = 1. This yields 


n 


=i lu ak 
» (ev z ox 2) Ion Ss aL y,n(U) Sa —+ a, 


t=1 


where we used the equality Do«(au, Wo) = as lul. 
Setting a = (2e +(p—1)X 2) /(2y) for € > 0 (to be determined later), and dividing by 
£ > 0, gives 


n 2 
ee Lyn) | (P- 1)X? L,,,(u) fe (2e + (p — 1)X°) lul? 
ety = y Jg y 4ey? 2. 
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To minimize the bound, set 


2 
es Jc DyL w+ (27x ul ) 
e= —_ — h — . 
llull, Y, 2 P q 


With an easy algebraic manipulation we then get 


i L nu) P a 1 Xp ž 
lg) S = + lul 
2 Sry y 2 y q 


„žo le Ie pea Ca iy 
y y 2y 


Using Ja +b < Va + vb for all a, b > 0 we get the inequality stated in the theorem. 
Since u and y > 0 were arbitrary, the proof is concluded. W 


The forecaster of Theorem 12.1 is also known as the p-norm Perceptron algorithm. For 
p = 2, this reduces to the classical Perceptron algorithm, whose weight updating rule is 
simply w; = wW:—1 + y:Xrlt5,45,;. Note also that, for p = 2 and L, „(u) = 0, the bound of 
Theorem 12.1 reduces to 


2 
n max _||x;|| lull 
tilaat 


X lise < 


t=1 


K 


In this special case, the result is equivalent to the Perceptron convergence theorem (see 
Section 12.6). More precisely, L,,,(u) = 0 implies that the sequence (x1, y1), .-., Xn, Yn) 
is linearly separated by the hyperplane u with margin at least y. The margin of the linearly 
separable data sequence with respect to the separating hyperplane u is defined by min, y,u - 
x;/ ||u||. The Perceptron convergence theorem states that the number of mistakes (or, 
equivalently, updates) performed by the Perceptron algorithm on any linearly separable 
sequence is at most the squared ratio of (1) the radius of the smallest origin-centered 
euclidean ball enclosing all instances and (2) the margin y of any separating hyperplane u 
(see Figure 12.2). 

The dynamic tuning à; = 1/ ||x;|| is known to improve the empirical performance of 
the Perceptron algorithm. Indeed, we can easily extend the analysis of Theorem 12.1 (see 
Exercise 12.1) and prove a bound that, in the case of linearly separable sequences, can be 
stated as 


n 2 
5i TEE E lix: I| llull 
{yEy} = t=1 i Yı ux, ’ 
t= 


Pa 


where u is any linear separator. The improvement is clear when we rewrite the bound for 
the Perceptron with static tuning à = 1 as 


2 
n max ||x;|| lull 
t=1,, 3,7 


min (ru: x) 
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Figure 12.2. The instances x, € R? of a linearly separable sequence (X1, y|),..., (Xn, Yn). Empty 
circles denote instances x, with label y, = 1 and filled circles denote instances with label 
yı = —1. A separating hyperplane, passing through the origin, with margin y is drawn. The Per- 
ceptron convergence theorem states that, on this sequence, the Perceptron algorithm will make at 
most (R/y)? mistakes or, equivalently, perform at most (R /y} updates. 


Exponential Potential and Winnow 

We proceed by considering the conservative linear forecaster based on the Legendre expo- 
nential potential P(u) = e"! + ----+ e". Unlike the analysis of the polynomial potential, 
here the choice of the learning rate makes a difference. Indeed, our result does not address 
the issue of tuning A, and the regret bound we prove retains the same general form as the 
bound proven in Theorem 11.3 for the linear pattern recognition problem. 


Theorem 12.2. Assume the conservative forecaster using the Legendre exponential poten- 
tial ®(u) =e" +---+e" with normalized weights is run with learning rate à = 
(2ye)/X2,, where 0 < & < 1, on a sequence (X1, y1), (X2, y2)... € R? x {—1, 1}. Then 
for all n > 1 such that X% > Max;=1,...n ||Xr||oo and for all u € R? in the probability 


simplex, 


Poe 


Dh: /Xx\ hd 
Ir < Y: 
2 {Py} = 1 nrg + ( y ) 2e(1 JE £) 


Proof. Following the proof of Theorem 11.3, we upper bound Do»(w;—1, w+) as follows: 


Ay 
Do«(Wi-1, Wr) < Dox (w1, w,) < 7 Xoo: 


where Wit = Wi7-1e° V&W): and w, is w, normalized (as in the proof of Theorem 11.3, 
we applied the generalized pythagorean inequality, Lemma 11.3, in the second step of this 
derivation). Now apply this bound to (12.1) for œ = 1 and for any u € Rf in the probability 
simplex. This yields 


a2 
x (xv — 5x2) Lisiy) < AL; n0) + Ind, 


t=1 
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where we used the inequality Do«(u, Wo) < Ind for Wo = (1/d,..., 1/d). Dividing both 
sides by Ay > 0 and substituting our choice of à yields the desired bound. E 


The forecaster used in Theorem 12.2 is a “zero-threshold” variant of the Winnow algo- 
rithm. In the linearly separable case, the bound of Theorem 12.2, with € = 1/2, reduces 
to 


max, [Xho 


2 
n 
D Inen <2| === ] Ing, 
t=1 


where y is the margin of the hyperplane defined by u. A comparison with the corre- 
sponding bound for the Perceptron algorithm reveals an interplay between polynomial and 
exponential potential analogous to that discussed in Section 11.4. 


The Second-Order Perceptron 

We close this section by showing an analysis of the conservative classifier based on the 
Vovk—Azoury—Warmuth forecaster (see the discussion at the end of Section 12.2). Fol- 
lowing the terminology introduced by Cesa-Bianchi, Conconi, and Gentile [47], we call 
this classifier second-order Perceptron. At each time step t = 1, 2,... the second-order 
Perceptron predicts with J, = sgn(W;' x,), where 


t—1 t—1 
W =A Yo VX and A= (a +Ý xx, ly) +X x) 


sel s=1 


(following the notation introduced in Chapter 11, we use W, instead of W;_; to denote the 
weight vector of this forecaster at time t). Note that we have introduced a parameter a > 0 
multiplying the identity matrix /. Even though this parameter is not used in the analysis of 
the second-order Perceptron, it becomes convenient when we compare the behavior of this 
algorithm with that of the standard Perceptron. 

Rather than using the inequality (12.1), we follow the arguments developed in 
Section 11.8 for the analysis of the Vovk-Azoury-Warmuth forecaster. 


Theorem 12.3. If the second-order Perceptron (the conservative Vovk-Azoury-Warmuth 
forecaster) is run on a sequence (Xj, y1), (X2, Y2)... € R? x {-1, 1}, then for alln > 1, 
for allu € R4, and for all y > 0, 


d 


n Xi 
DONE = a a y (a lul? T u” A,„u) Py In (1 pa +), 


i=1 


where h,,..., Aq are the eigenvalues of the matrix 
n 
= 
An = >_> xi x Tigy) 
t=1 


The bound stated by Theorem 12.3 is not in closed form as the terms Iis, +y,} appear on both 
sides. We could obtain a closed form by replacing A, with the matrix x; Xx, tee $x, x! 
including all instances. Note that this substitution can only increase the eigenvalues 


12.2 The Hinge Loss 341 


À1, .-., Aa. However, the closed-form bound is probably too weak to reveal the improve- 
ments brought about by the second-order Perceptron analysis with respect to the classical 
Perceptron (see the end of this subsection for a detailed comparison of the two bounds in a 
special case). 


Proof of Theorem 12.3. We follow the notation of Section 11.8 with the necessary adjust- 
ments because we have introduced the new parameter a and we are considering conservative 
forecasters. So, in particular, define 


2 1 t-1 A 
p7 (u) = É lull? + 7 X (u'x, = Ys) haan ` 


s=l 


This potential is connected to W, by the following relation (see the proof of Theorem 11.8): 


Tepes 2 : i EE 
5 Pe x =y) gy = inf D7) — inf DO) + 5x) AP X Tay 


1 T 4-1 ew Tas aN 2 
75 (x; Az’) (W, X;) Ligy} 


where, if Vy; = y,, the equality holds because inf, (v) = inf, ®* (v). We drop the last 
term, which is negative because A,_; is positive definite, and sum over ¢ = 1,...,7, 


obtaining, for any u € Rf, 
1X 2 
ST 
z Yo (FX = ye) Tis 490 
t=1 


. ; loot, 

< inf ©, (v) — inf &;(v) + 5 2s 4; Xi Tsy 
: 1 n x A 

pi 0) + 2 DS X, A; x Is, Ay.) 


t=1 


IA 


n 


a 1 2 _ _ 
= 5 lull’ +;5 S (ux, =») Ts + J ox Ar x Tigy 


t=1 t=1 
where we used infy ®}(v) = 0. Expanding the squares and performing trivial simplifica- 


tions, we get to the following inequality: 


RETR = 1 7 
2 » ((W)x,)" = 2y, 9) x) liz) S 5 |e lull? + Y ux) ton | 
t=1 


t=1 
n 1 n 
T Tail 
= oyu xX Lisy) + 2 2x A, Xr Lisiy): 
t=1 t=1 


Note that the left-hand side of this inequality is a sum of positive terms, because 
—2y,W) x; > 0 whenever Y,  y,. In addition, we can write 


1 7 2 1 “ 1 
2 e lul? ag J (u'x,) ton | = z" ( T yo x! kin] u= z" Anu 


t=1 t=1 
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and, using Lemma 11.11, 


n d 
1 f 1 
5 Dx Ar x Tipy $ a uh (1+ *) 


t=1 i=1 


This allows us to write the simpler form 


n d 
1 1 Ài 
0 < 5u (al + Ana — ya x Toy + 5 3 j ( + =) 


t=1 


Since u was chosen arbitrarily, this inequality also holds when u is replaced by au, where 
a > 0 is a free parameter. Performing this substitution, we end up with 


2 
0< ^u (I +A, way ai X ligy} + = sdm(i4 =). 


t=1 


To introduce hinge loss terms, observe that — yal x; < ly) — y for all y > 0. Sub- 


stituting this into the above inequality, rearranging, and dividing both sides by ay > 0 
yields 


n 
5 Iy,4y,) = 
t=1 


Substituting the choice 


wa tess f(s) 


J2 iln (1 + A;/a) 
a= 
u' (I + A,)u 


implies the claimed bound. W 


We now compare the bound of Theorem 12.3 with the corresponding bound for the 
Perceptron algorithm (Theorem 12.1 with p = 2). 


Consider the simple case of a sequence (X1, y1),..., Xn, Yn) where ||x;|| = 1 for all t 
and such that there exists some u € R4, with ||ul| = 1, satisfying y; u! x; > y > 0 for all 
t =1,...,n. For this linearly separable sequence, the bound of Theorem 12.3 may be then 
written as 


1 d jE 
X Tiy) < 7 (a +uTAnu) X In (: + *), (12.2) 


t=1 i=1 


Recall that this bound is not in closed form as A, (and its eigenvalues) depend on the 
mistake terms Ix5,z,,). Then let m be the largest cardinality of a subset M C {1,..., n} 
such that (12.2) is still satisfied when a mistake is made on each t € M. Moreover, let 
m' = 1/y?, where 1/y? is the Perceptron bound specialized to the case ||u|| = 1 and 
|x;|| = 1 for all ¢. We want to investigate conditions on the sequence guaranteeing that 
m < m'. To do that, we represent m’ as the unique positive solution of 
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Thus m < m’ whenever 


d 
(a+u'A ae (1+ *) <m. (12.3) 


Now note that, since ||u|| = 1 and ||x; || = 1 


n 
u A,u= CES Igy) < m 


t=1 


Using (u Tx) = (y.u Tx) > y?, we a y?m <u'A,u < m for allu € R°. Hence, 
we may write u! A„u = æm for some y? <a <1. Since ||x,|| = 1 also implies A; + 

-+ àq =m, we may set A; = a;m, where the coefficients a,,...,a@ > 0 are such that 
a, +---+ a, = 1. Performing these substitutions in (12.3) we obtain 


d 
(a +am) n(1 g 
i=1 


Ifam =u! A„u is small compared with the large eigenvalues of A,, then there exist choices 
of a that satisfy (12.4) (see Exercise 12.5). 

This discussion suggests that the linearly separable sequences on which the second-order 
Perceptron has an advantage over the classical Perceptron are those where linear separators 
u tend to be nearly orthogonal to the eigenvectors of A, with large eigenvalues. In such 
sequences, a large share of instances x, must thus have the property that y; u! x, is close to 
the minimum value y. 

As a final remark note that the bounds of Theorems 12.1 and 12.3 are invariant to 
simultaneous rescalings of y and ||u|| that do not change the ratio ||u|| /y (Gn Theorem 12.1 
this ratio should take the form ||u||, /y). This is what we expect, because the loss L, „(u)/y 
exhibits the same kind of invariance. Hence, we do not lose any generality if these results 
are stated with y set to 1. 


“) 2m. (12.4) 
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In this section we study the scenario in which a forecaster is repeatedly run on the 
same sequence of examples. More specifically, we say that a forecaster is cyclically run 
on a “base sequence” (X1, y1),---; (Kn, Yn) E€ R? x {—1, 1} if it is run on the sequence 
i 91), (X,Y)... -, Where Ogag Yenye) = Xr; Yr) for all k > Oandt =1,...,n. 

If the base sequence is linearly separable, then the mistake bound for a conservative 
forecaster tells us how many updates are performed at most before the forecaster’s current 
classifier converge to a linear separator of the base sequence. For example, the Perceptron 
convergence theorem (see Section 12.2) states that at most (max, ixl / vy) updates are 
needed to find a linear separator for any sequence linearly separable with margin y > 0. 
However, the results of Section 12.2 do not provide information on the margin of the 
separator found by the forecaster. 

The question addressed here is whether we can modify the forecasters of Section 12.2 
so that the classifier obtained after the last update has a margin close to the largest margin 
achievable by any linear separator of the sequence. 
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Assume that the sequence (x1, y1), (X2, y2),... € R? x {-1, l} is linearly separable by 
u € R? such that ||x,|| = 1 for all t£. We now show that the following algorithm, a simple 
modification of the Perceptron, when cyclically run on a sequence with margin y, finds a 
linear separator with margin (1 — a)y after at most 1/(@y)? updates, where œ is an input 
parameter. Following the terminology of Gentile [123], we call this modified Perceptron 
ALMA (approximate large margin algorithm). 


THE ALMA FORECASTER 
Parameter: a € (0, 1]. 
Initialization: wọ = (0,...,0),k = 1. 
For each roundt = 1,2,... 
(1) y = (V8/K) /a; 
(2) observe x,, set D; = w;—1 - X;, and predict with Y, = sgn(p;); 
(2) get label y, € {—1, 1}; 
(3) if y; W1 +X; < (1 — @)y;, then 
(3.1) m = V2/k and w, = wi + Yr Xr; 
(3.2) w, = wi/ |w; 
(3.3) k<k+1; 
(4) else, let w, = W;_1. 


> 


Theorem 12.4. Suppose the ALMA forecaster is cyclically run on a sequence 
(X1, Y1), +++ En, Yn) € RY x {—1, 1} with ||x,|| = 1 for all t, linearly separable by u € R4 
with margin y > 0. Let m be the number of updates performed by ALMA on this sequence. 
Then the number of mistakes m = Y°°~; l,5,4y/ is finite and satisfies 


Furthermore, let s be the time step when the last update occurs. Then the weight W, computed 
by ALMA at time s is a linear separator of the sequence achieving margin (1 — @)y. 


Proof. For any t = 1,2,..., let N; = ||w,||, and y,(u) = y, u- x;. We first find an upper 
bound on m by studying the quantity u- w,. Choose any round ¢ such that y; w;_1 - x; < 
(1 — a)y,. Then 


Us Wi Nyu: X x uU: Wi Fy 
N, E N, 


u- w, = 
and 
N? = |w 
= Iwi + y: x|? 
= 1+ 07 + 2m ye Wii: X; 
< 14+ +20 a)y. 
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The inequality holds because an update at time t implies that y; w,;_1 - x, < (1 — æœ)yı. 
Substituting the values of yn, and y, in the last expression, we obtain N? < 1+2A/k, 


where A = 4/a — 3 and k; is the number of updates performed after the first ¢ time steps 
Now we bound m by analyzing u - w, through the recursion 


u: Ws-1 + NsV 
JI +2A/m 


os MEE a Y 
— JIF2A]Jm fm/2+ A 


Solving this recursion, while keeping in mind that wọ = Oandu - w, = u - w,;_ if no update 
takes place at time ¢, we obtain 


m 


1 
TeL nora | yay 


k+1 


where for k = m the product has value 1. Now. 


m 


1 2A 
—In = Inj 1+ — 
Iaeaq- can 


j=k+1 


<- DPE — (since In(1 + x) < x for all x > —1) 
25 =k+1 


re dx 
A pe 
jak X 


m 
=Aln—. 
k 


Therefore, 


I gers G) 
ee a AS 


Now, since u- w, < 1, we obtain 


(k/my4 
12) pred 

” (k/m)A 
p ORTF Jmj2+A 


> al kA dk 
mA /m/2+A Jo 
y m 


A+1 /m/2+A- 
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Solving for m yields 


A+1) A+ 1) A+1)*A 
EN P ( PAA +1) 
4y2 16y4 y? 


(A+D? (A+1ọ2 JA+1 
< + + 


1 
4y? Y 16y? 
(A+1°F (A+1ř° 

< XA+1 

=a + ay? +2(A + 1), 


where we used the inequality vx + 1 < /x +1/ (25/5) for x > 0 in the last step. Substi- 
tuting our choice of A yields the desired result. 
To show that w, is a linear separator with margin (1 — a)y, note that 


1/8 
Ys = 4/ 
m 


a 

1 8 

ay HÈ 421441) 
1 Y 


ay] GED 4 AI 
e teg 


IV 


IV 


(since 0 < y < 1) 


= Y 
T ia 


Zy 


Since the last update occurs at time n, this means that y,w;-x, >(1—a)y 
foralt>s. E 


12.4 Label Efficient Classifiers 


In Section 6.2 we looked at prediction in a “label efficient” scenario where the forecaster 
has limited access to the sequence of outcomes y1, y2, .... Using an independent random 
process for selecting the outcomes to observe, we have been able to control the regret of 
the weighted average forecasters when an a priori bound is imposed on the overall number 
of outcomes that may be observed. 

In this section we cast label efficient prediction in the model of linear classification with 
side information: after generating the prediction Y; = sgn(p;) for the next label y; given 
the side information x,, the forecaster uses randomization to decide whether to query y 
or not. If y, is not queried, its value remains unknown to the forecaster, and the current 
classifier is not updated. It is important to remark that, as in Section 6.2, in this model also 
the forecaster is evaluated by counting prediction mistakes on those time steps when the 
true labels y, remained unknown. 
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We study selective sampling algorithms that use a simple randomized rule to decide 
whether to query the label of the current instance. This rule prescribes that the label should 
be obtained with probability c/(c + |p,|), where P, is the margin achieved by the current 
linear classifier on the instance, and c > 0 is a parameter of the algorithm acting as a scaling 
factor on pr. Note that a label is sampled with a small probability whenever the margin is 
large. 

Unlike the approach described in Section 6.2, this rule provides no control on the number 
of queried labels. In fact, this number is a random variable depending, through the margin 
Pı, on the interaction between the algorithm and the data sequence on which the algorithm 
is run. Owing to the complex nature of this interaction, the analysis fails to characterize the 
behavior of this random variable in terms of simple quantities related to the data sequence. 
However, the analysis does reveal an interesting phenomenon. In all of the label efficient 
algorithms we analyzed, a proper choice of the scaling factor c in the randomized rule yields 
the same mistake bound as that achieved by the original forecaster before the introduction 
of the label efficient mechanism. Hence, in some sense, the randomization uses the margin 
information to select those labels that can be ignored without increasing (in expectation) 
the overall number of mistakes. 

To provide some intuition on how the randomized selection rule works, consider the 
standard Perceptron algorithm run on a sequence (X1, y1),.--, (Xn, Yn) € R¢ x {-1, 1}, 
where we assume ||x || = 1 for t = 1,...,”. For the sake of simplicity, assume that this 
sequence is linearly separated by a hyperplane u € R°. Recall that P, = w;_1 - x;. A basic 
inequality controlling the hinge loss in this case (see Section 12.2, and also the proof of 
Theorem 12.5 below) is 


(= Bde < 5 (Ite wil? = lu — w? +1). 

Now, if y; P, < 0 and |p;| is large, then the hinge loss (1 — y,p;)+ is also large. This 
in turn implies that the difference |/u — w;_; |? — lu — w|? must be big. This means 
that ||u — will? drops as w;_; is updated to w,. So, whenever the Perceptron makes 
a classification mistake with a large margin value |p,|, the weight w,_; is moved by 
a significant amount toward the linear separator u. In this respect, mistakes with large 
margin bear a bigger progress than mistakes with margin close to 0. On the other hand, 
the standard Perceptron algorithm does not take into account the information brought 
by |p;|. The basic idea underlying the label efficient method is a way to incorporate this 
information into the prediction by using the size of |p; | to trade off a potential progress witha 
spared label. 

We now formally define and analyze a label efficient version of the Perceptron algorithm. 
Similar arguments can be developed to prove analogous bounds for other conservative 
gradient-based forecasters analyzed in this chapter (see the exercises). 

As our forecasters are randomized, we adopt the terminology introduced in Chapter 4. 
The forecaster has access to a sequence Uj, U2,... of i.i.d. random variables uniformly 
distributed in [0, 1]. The decision of querying the outcome at time t is defined by the value 
of a Bernoulli random variable Z, of parameter q; (where q; is determined by U;,..., U;—-1 
and by the specific selection rule used by the forecaster). To obtain a realization of Z,, 
the forecaster assigns Z; = 1 if and only if U, € [0,q,). The sequence of outcomes is 
represented by the random variables Y,, Y2,..., where each Y, is measurable with respect 
to the o-algebra generated by U,,..., U;_1. This implies that Y, is determined before the 


348 Linear Classification 


value of Z, is drawn. Our results hold also when instances x; are measurable functions of 
U,,..., U;-1. However, to keep the notation simple, we derive our results in the special 
case of arbitrary and fixed instance sequences. 


THE LABEL EFFICIENT PERCEPTRON ALGORITHM 
Parameter: c > 0. 
Initialization: wo = (0, ..., 0). 
For each round t = 1,2,... 


(1) observe x,, set D; = w;—1 - X;, and predict with Y, = sgn(p;); 

(2) draw a Bernoulli random variable Z, € {0, 1} of parameter c/(c + [P;|); 
(3) if Z, = 1, then query label Y, € {—1, 1}, and let w, = w,_1 + Y; x, Its,zy,}; 
(4) if Z, = 0, then w, = w;_}. 


Theorem 12.5. If the label efficient Perceptron algorithm is run on a sequence 
(xı, Y1), &2, Yo)... € R? x {-1, 1}, then for all n > 1, forall u € RI, and for all y > 0, 
the expected number of mistakes satisfies 


n 2 2 2 2 
L,,(u X Ly, lull“ (2c + X 
5 È toan) < UR Lem ( ) 
t=1 


’ 


y 2c y 8cy? 


where X = MaX;=1,...n ||Xrl|- 


Note that by choosing 


O xX X |jull \7 
c= Xe rt pate T ( 2 ) 


one recovers (in expectation) the bound shown by Theorem 12.1 (in the special case 
p = 2). However, as c is an input parameter of the algorithm, this setting implies that, at the 
beginning of the prediction process, the algorithm needs some information on the sequence 
of examples. In addition, unlike the bound of Theorem 12.1 that holds simultaneously for 
all y and u, this refined bound can only be obtained for fixed choices of these quantities. 


Proof of Theorem 12.5. Introduce the Bernoulli random variable M, = lj5,zy,;. We start 
from the chain of inequalities 


y — ly) <u- (—V2ey,.(w,-1)) 
< (u — w1) (VL; (w-1) 


= Dæ (Uu, w1) — Da (U, W;) + Dox(Wi_1, Wr), 


which we used for the derivation of (12.1) in Section 12.2. This holds for any conservative 
gradient-based forecaster on any time step ¢ such that M, = 1. 

Just like the Perceptron, the label efficient Perceptron uses the quadratic potential ®(v) = 
@*(v) = 5 i|v||?, and thus Do«(u, v) = 5 lu — vil. Consider now a time step £ where the 
label efficient Perceptron queries a label and makes a mistake. Then Z; = 1, M, = 1, 


12.4 Label Efficient Classifiers 349 
and —V £; (W:-1) = Y;x;. Hence, we may rewrite this chain of inequalities as follows: 


y= £0) 
<Y, u- x, 


= Y(U — W,-1 + W-1) X 


y W1: X; + u W,— u W + Wi W . 
t "t—1 t 2 t—1 2 t 2 t-1 t 


Note that this time we obtained a stronger inequality by adding and subtracting the negative 
term Y, w;_1 - X; = Y, D;. The additional term provided by this more careful analysis is the 
key to obtain the final result. 

Using Y, P, < 0 and replacing u with au for œ > 0, we obtain the inequality 


(ay a IPrl)M,Z, 


1 1 1 
< aly,(w) + 5 llau will? 5 llu wl? + awas wll? 


that holds for all time steps t. Indeed, if M,Z, = 0 the inequality still holds because 
œl, (U) > 0 and w;_; = w;. Summing for t = 1,..., n, we get 


n 


a a ee 
ilar + IPI)M, Z, < Ly nw) + > lul? + 5 DO wi — wil? 


t=1 t=1 
where &? |ual] = ||au — wol| and we dropped — ||au — wll? /2. Finally, since M,Z, = 0 
implies ||w,_; — w;|| = 0, using ||w,;_; — w, ||’ < X? we get 


n 


xX? a 
~ 2 
` (or + |P:l -— ar )m, Z, < aL, nU) + E lull. 
t=1 


Now choose œ = (c + X?/2)/y for some c > 0 to be determined. The above inequality 
then becomes 


n 2 
me cLy (a) X? Lyn) lal? (2c +X? 

X (c+ IP) Z, < m y R ( z yo 

Fa 2 y 8y 
We now take expectations on both sides. Note that, by definition of the algorithm, E, Z, = 
c/(c + |p;|), where we use E, to indicate conditional expectation given U4, ..., U,—1. Also, 
M, and P, are measurable with respect to the o-algebra generated by U4, ..., U;_,. Thus 
we get 


| Ste mz] = [e+ BM xz] = | Sem. 


t=1 t=1 


Dividing both sides by c, we arrive at the claimed inequality 


n ise X? L,, ull? (2c + x2) 
| 50m, | < yan) yn), [ull ( i = 
y 2c y 8cy? 
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Figure 12.3. A set of labeled instances x, € R is shown on the abscissa as empty circles (label —1) 
and filled circles (label +1). This set is not linearly separable in R. However, by mapping each x, € R 
via ġ(x) = (x, 1+ xJ/2 + x?) we obtain a linearly separable set in R2. The coefficient 5/3: is chosen 
so that inner products ¢(x)@(x’) between mapped instances can be computed using the polynomial 
kernel function K (x, x’) = (1 + xx’). 


12.5 Kernel-Based Classifiers 


Kernel functions are an elegant way of turning a linear forecaster into a nonlinear one with 
a reasonable computational cost. As a motivating example, consider the following simple 
reduction from quadratic classifiers in R°? to linear classifiers in RÉ. In R?, a quadratic 
classifier f : R? — {—1, 1} is defined by 


f (1, x2) = sgn(p(x1, x2)), 


where p(x1, x2) = Wo + w1x1 + W2X2 + W3X1X2 + w4x? + wsx? is any second-degree 


polynomial in the variables x; and x2. The decision surface of f is the set of points 
(x1, x2) € R? satisfying the equation p(x1, x2) = 0. The decision surface of a linear clas- 
sifier is a hyperplane, whereas for quadratic classifiers the decision surface is a conic (the 
family of curves to which ellipses, parabolas and hyperbolas belong). To learn a quadratic 
classifier with a linear forecaster, it is enough to observe that p(x1, x2) of the above form 
can be written as w- x’ for w = (wọ, W1, -.., Ws) and x’ = (1, X1, X2, X1X2, xa x3). Thus, 
we can transform each instance x; = (X1,;, X2,1) via the mapping 


2 2, 
Cis X21) = (1, ii tay p X21) = X, 


and then run the linear forecaster on the transformed instances x; € R®° instead of the 
original instances x, € R? (see Figure 12.3 for a 1-dimensional illustration). The vector x’ 
is often called a feature vector, and in the example considered, R° plays the role of the 
feature space. 
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This simple trick can be easily generalized to learn any kth-degree polynomial decision 
surface in R?. However, the computational cost of implementing the mapping ¢, even for k 
moderately large, is too high. In fact, (5) coefficients are needed to represent a kth-degree 
polynomial surface in R, implying that we have to run our linear forecaster on instances 
of dimension exponentially large in k. 

Computational problems nearly disappear if the classification of x,, at an arbitrary time 
t, can be computed using only inner products between instances. For example, the classifier 


computed by the Perceptron algorithm at time t can be written in the form 


f(x) = sgn (x Oj V1, Xt; ° s) , 


where œ; € R and the sum ranges over a subset of the instance sequence X1, ..., X;-1. 
Now suppose the Perceptron is run on the transformed instances x, = $(x,), where we have 
rewritten the ¢ of our initial example as $(x1, x2) = (1, Xı V2, xoV/2, x1x2V/2, Kes x2). Note 
that the introduction of the scaling coefficients /2 makes no difference for the learning 
problem faced by the forecaster. Then, as $(x,) - (x) = (14+ x, - x)”, we can avoid the 
computation of any $(x;). Indeed, 


f(x) = sgn (x Oi Yn ln) + so) = sgn (£ ot: Yy (l H Xg - v) s 


L 


In general, if ¢ maps x € R? to ¢(x) = x’ whose components are all the monomials 
of a kth-degree polynomial in the variables x (with suitable scaling coefficients), then 
PX) - (x) = (1+ x, -x)*. Hence, the forecaster can learn a polynomial classifier without 
ever explicitly computing the coefficients of the polynomial curve. 

Note that saying that a forecaster manipulates the transformed instances (x) using 
only inner products implies that the computation performed by the forecaster is invari- 
ant to transformations that map the instance sequence (X1, X2,...) to (Ax;, AX2,...), 
where A performs a change between two orthonormal bases. To see this, note that 
(Au)'(Av) = u'A' Av = u' v. Such forecasters are sometimes called rotationally invari- 
ant. Unfortunately, not all gradient-based forecasters for classification are rotationally 
invariant. In particular, among the linear forecasters studied in this chapter, only the Per- 
ceptron and the second-order Perceptron have this property (see Exercise 12.12). 

In view of extending this approach to surfaces that go beyond polynomials, we investigate 
the conditions guaranteeing that a symmetric function K : R? x R? — R has the property 
K (u, v) = (¢(u), (v)) for all u, v € R? and for some ¢ mapping R’ to a Hilbert space (we 
use (-,-) to denote the inner product in this space). We call kernel any such function K. 
Note that we changed the range of ġ from a finite-dimensional euclidean space to a Hilbert 
space. In this space, our transformed instances (x) are vectors with possibly an infinite 
number of components. This allows, for example, to learn a certain class of infinite-degree 
polynomial decision curves. For reasons that are made clear in the proof of the following 
result, the Hilbert space H associated to a kernel function is called reproducing kernel 
Hilbert space. 

It turns out that a simple characterization of kernels exists. 
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Theorem 12.6. A symmetric function K : R! x R! => R is a kernel if and only if for all 
n € Nand for all x, ...,Xn € R? the n x n matrix K with elements K (x;, xj) is positive 
semidefinite. 


Proof. Assume first that K : R? x R¢ — R is such that K (u, v) = (#(u), ¢(v)). Fix any 
positive integer n € N, choose xj,...,X, € RI arbitrarily, and let K be the associated 
matrix. Then, for all u € R4, 


n 
u' Ku = ) K (Xi, X juju; 
i,j=l 
n 


= 5 (9x), p&p) uiu; 


i j=l 


(Eram). X oau; ) 
i=l j=l 


> 0. 


i 2 
Yo oau; 
i=l 


Hence K is positive semidefinite. 

Assume now K : R? x R? > R is such that, for any choice of X1, ...,Xņ € R7, the 
resulting kernel matrix is positive semidefinite. Introduce the linear space V of functions 
f : R? = R defined by 


fO=}J_ æK,- where né€N, a, €R,i=1,...,n, 


i=1 


where we set a(f + g)(x) =a f(x) +a g(x) for any œ € R and u € R. We now make V 
an inner product space. Introduce the operator ( -, - } such that, for any two f, g € V defined 
by 


fO=J aK, ) and g=) Kj >), 
i=1 j=1 


we have 


(ee) eae at; B; K (uj, vj). 


Clearly, (-,-) defined in this way is real valued, symmetric, and bilinear. In addition, for 
all f ey, 


m n 


(ff) = 5 do aja) K(u;, uj) = «' Ka > 0 


i=1 j=1 


because K is positive semidefinite by assumption. Thus, to verify that (-,-) is indeed an 
inner product on V, we just have to show that (f, f) = 0 implies f = 0. To see this, first 
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note that 


(f, K(x, -)) = > aiK (x, uj) = f(x) (reproducing property), 
i=l 

where the first equality follows from the definition of ( -, -) by taking g(-) = K (x, -). Hence, 
f(x? = (f, K(x, -))” < (f, f) K (x, x) by the Cauchy—Schwarz inequality, and this yields 
the desired implication. Thus V endowed with (-, - ) is an inner product space. 

To make VY into a complete Hilbert space, we introduce in VY a norm defined by || f — g|| = 
Vf — 8, f — 8). 

By Lemma 12.1, all Cauchy sequences in VY have a pointwise limit. Let H be the set 
obtained by adding to V all the functions g that are pointwise limits of Cauchy sequences 
with respect to this norm. For any f, g € H, define 


(f. 8) = lim (fms 8n) and IF lle = lim | fmll . 
n,m—> oo m—> oo 
where fi, fo, ... and g1, g2,... are Cauchy sequences in V with pointwise limits f and g, 


respectively. It is easy to check that (-, - )z, is well defined (i.e., independent of the choice 
of the sequences fm and g, converging pointwise to f and g) and that it is an inner product 
in H. It is also easy to see that H is a complete space (with respect to ||-||7,) in which V is 
dense. Hence 7 is an Hilbert space. 

To conclude the proof, we define the mapping ¢ : R? > H by ¢(x) = K (x, -). Then, 
the reproducing property ensures that K (u, v) = (ġ(u), d(v)). E 


Note that the identity @(x) = K (x, -) provides a representation of the mapping ¢ directly 
in terms of the kernel function K. 


Remark 12.1. In the proof of Theorem 12.6 we obtain the same characterization when 
IR¢ is replaced with an arbitrary set S. Hence, kernels may be more generally defined as 
functions K : S x S — R where no assumptions are imposed on S (e.g., S can be a set of 
combinatorial structures such as sequences, trees, or graphs). Since any kernel K defines a 
metric d in H by 


dls, s’) = ||(s) — 66s.) || = VK, s) + K (s', s) — 2K (s, s”), 


we may view a kernel as a way to embed an arbitrary set of objects in a metric space. 
We now state and prove Lemma 12.1, which we used in the proof of Theorem 12.6. 
Lemma 12.1. For any sequence (fi, f2, ...) of elements of V, if 

im sup Il.fm — fall = 9 


(i.e., the sequence is a Cauchy sequence), then g = limy-+oo fn, defined by g(x) = 
lim, 00 fn(x) for all x € R¢, exists. 


Proof. Fix x and consider the sequence ( Six), fo(x), .- Je Note that 


fn) — fl = fn — fa, K(X) < Wf — fall VK x), 
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where we used the reproducing property in the first step and the Cauchy—Schwarz inequal- 
ity. Thus, because (f1, f2,...) is a Cauchy sequence (fi (x), fo(x),.. :) also is a Cauchy 
sequence. By the Cauchy criterion, every such sequence on the reals has a limit. E 


Kernels of the form K (u, v) = (1 + u- v) fork € N are appropriately called polynomial 
kernels. A closely related kernel is the homogeneous polynomial kernel K (u, v) = (u- v}. 
An infinite-dimensional extension of the homogeneous polynomial kernel is the exponential 
kernel K (u, v) = exp(u- v/o?) for o > 0. The Taylor expansion 


(u. v} 


CO 
exp(u- v) = 5: T 
k=0 i 


reveals that exponential kernels are linear combinations of infinitely many homogeneous 
polynomial kernels, where the coefficients of the polynomials decrease exponentially with 
the degree. By enforcing ||¢(x)|| = 1 or, equivalently, K (x, x) = 1, the exponential kernel 
is transformed as follows: 


K (u, v) explu - v/o?) 


a = exp(— llu — vll? /20°). 
~K (u, WK (v, v) „explu - u/o2) exp(v - v/o?) eee alse") 


This is the gaussian kernel, widely used in pattern classification. The classifier constructed 
by the Perceptron algorithm run with a gaussian kernel corresponds to a weighted mixture 
of spherical gaussians with equal variance and centered on a subset of the previously seen 
instances. Linear classifiers in the feature space defined by gaussian kernels have often 
been called radial basis function (RBF) networks. 


Mistake Bounds and Computational Issues 

The mistake bounds shown in Section 12.2 extend naturally to kernels. Consider, for 
instance, the second-order Perceptron run with a generic kernel function K in a reproducing 
kernel Hilbert space H. Pick any sequence (Xj, y1),..-, (Kn, Yn) € R? x {—1, 1} and let 
the cumulative hinge loss of any function f € H on this sequence be defined by 


n 


Lya =Y (y = y FED) 


t=1 


Then the number of mistakes made by the second-order Perceptron is bounded as 


n ; TP) l n n 
> 2 < 2 = 2 $ 
= lon S a Ifi=1 is y p 2 FR) 2 ma PAD J 


Y i=1 


where the numbers À; are the eigenvalues of the kernel matrix with entries K (x;, x;) for 
i,j=l,...,n. 

Note that if a linear kernel K (x;, xj) = x) x j is used, so that f(x) = ul x for some 
u € Rf, then the mistake bound of Theorem 12.3 (for the choice ||u|| = 1) is recovered 
exactly. To see this let A = x; x] +--+ +X, X, and observe that 


n 
u'Au= ) (ux). 


t=1 
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Also, the nonzero eigenvalues of the matrix A coincide with the nonzero eigenvalues of the 
kernel matrix. 

We close this section by noting that the kernel-based version of the Perceptron uses 
space ©(m) to store a linear classifier and time O(m) to update it, where m is the number of 
mistakes made so far. The second-order Perceptron, instead, uses space @(m7) for storing 
and time @(m?) for updating (this can be shown via simple linear algebraic identities 
about the update of inverse matrices). Thus, kernel-based forecasters have space and time 
requirements that grow with the number of mistakes. An interesting thread of research is the 
design of principled techniques allowing to trade off a reduction of space requirements with 
a moderate increase in the number of mistakes. The label efficient analysis in Section 12.4 
is an example of this approach. 


12.6 Bibliographic Remarks 


Perceptrons, introduced by Rosenblatt [249] as an attempt to model “the capability of higher 
organisms for perceptual recognition, generalization, recall, and thinking,” are among the 
earliest examples of learning algorithms. Versions of the Perceptron convergence theorem 
were proved by Rosenblatt [250], Block [31], and Novikoff [225]. p-Norm Perceptrons 
were introduced and analyzed in the linearly separable case by Grove, Littlestone, and 
Schuurmans [133], as a special case of their quasi-additive classification algorithm (see also 
Warmuth and Jagota [305] and Kivinen and Warmuth [183]). Generalization of this analysis 
to sequences that are not linearly separable was proposed by Freund and Schapire [114], 
Gentile and Warmuth [125], and Gentile [124]. Perceptrons with dynamic tuning were 
considered by Graepel, Herbrich, and Williamson [132]. 

The Winnow algorithm was introduced by Littlestone [200] as an alternative to Percep- 
tron. Just as the Perceptron algorithm is the counterpart for classification of the Widrow— 
Hoff rule used in regression, the version of Winnow presented here is the classification 
version of the exponentiated gradient algorithm of Kivinen and Warmuth (see the biblio- 
graphic remarks in Chapter 11). Recalling the discussion at the end of Section 11.4, we 
may conclude that Winnow should perform better than Perceptron on data sequences that 
have dense instance vectors and are well approximated by sparse linear experts. In fact, 
Winnow was originally proposed for boolean side information, x, € {0, 1}, and for an 
expert class properly contained in the class of linear experts: the class of all monotone 
k-literal disjunction experts. Each such expert is defined by a subset of at most k coordi- 
nates, and its prediction on x, € {0, 1}¢ is 1 if and only if these k coordinates have value 
1 in x. As shown by Littlestone [200], if the data sequence is perfectly classified by some 
k-literal disjunction expert, then Winnow makes at most O (k Ind) mistakes. On the other 
hand, Kivinen, Warmuth, and Auer [184] show that there are boolean data sequences of the 
same type on which the Perceptron algorithm makes (2(kd) mistakes. For extensions and 
applications of the p-norm Perceptron to classification of k-literal disjunctions, see also 
Auer and Warmuth [15], Gentile [124], Littlestone [201]. 

The second-order Perceptron was introduced by Cesa-Bianchi, Conconi, and Gen- 
tile [47], who also studied variants using the pseudoinverse of x, Xi +e +X x" rather 
than the inverse of J +x; x] +--+ +x,x/. 

Forecasting strategies converging to a separating hyperplane with maximum margin 
have been proposed by several authors. A remarkable example is the Adatron of Anlauf 
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and Biehl [8]. However, finite-time convergence results for approximate maximum margin 
hyperplanes, such as the analysis of ALMA in Section 12.3, have been proposed only recently. 
Such results include the relaxed maximum margin online algorithm of Li and Long [199] 
and the margin infused relaxed algorithm of Crammer and Singer [75]. The ALMA algorithm 
and Theorem 12.4 are due to Gentile [123]. Support vector machines (Svs), an effective 
classification technique originally introduced by Vapnik and Lerner [294] (under a different 
name), find the maximum margin hyperplane at once by solving an optimization problem 
defined over the entire sequence of examples. In their modern form, svMs were introduced 
by Boser, Guyon, and Vapnik [38] and Cortes and Vapnik [67]. See the monographs 
Cristianini and Shawe-Taylor [76], Schélkopf and Smola [262], and Vapnik [292] for 
extensive accounts on the theory of svMs. 

The label efficient forecasters presented in Section 12.4 were introduced and analyzed 
by Cesa-Bianchi, Gentile, and Zaniboni [49]. However, similar techniques aimed at saving 
labels have been extensively studied in pattern recognition. See, for instance, the pioneering 
paper of Cohn, Atlas, and Ladner [66], the query by committee algorithm of Freund, Seung, 
Shamir, and Tishby [116], and the more recent approaches of Campbell, Cristianini, and 
Smola [45], Tong and Koller [289], and Bordes, Ertekin, Weston, and Bottou [35]. 

The study of reproducing kernel Hilbert spaces was developed by Aronszajn [9] in the 
1940’s. The use of kernels has been introduced in learning since 1964 with the influen- 
tial work of Aizerman, Braverman, and Rozonoer [1—3] and Bashkirov, Braverman, and 
Muchnik [23] (see also Specht [277]). 

However, it took almost 30 years before the potentialities of kernels began to be fully 
understood with the paper of Boser, Guyon, and Vapnik [38]. The books of Schélkopf and 
Smola [262] and Cristianini and Shawe-Taylor [77] are two excellent monographs on learn- 
ing with kernels. The proof of Theorem 12.6 is taken from Saitoh [255]. Kernel perceptrons 
were considered by Freund and Schapire [114]. The kernel second-order Perceptron is due 
to Cesa-Bianchi, Conconi, and Gentile [47]. 


12.7 Exercises 


12.1 (Perceptron with time-varying learning rate) Extend Theorem 12.1 to prove that the Percep- 
tron (i.e., the p-norm Perceptron with p = 2) with learning rate A, = 1/ ||x,|| achieves, on any 
sequence (x1, y1), (X2, y2)... € R? x {—1, 1}, and for all y > 0 and u € Rf, the bound 


7 L0) luy luly? 2,0) 
X Iian = Z + (=) + hae tan a 
t=1 


4 Y Y y 

where L, n0) =y (y — y,u-x,/ IIxil) , is the normalized cumulative hinge loss. 

12.2 (p-Norm perceptron for the absolute loss) By adapting the self-confident linear forecaster 
with polynomial potential introduced in Section 11.5, derive a forecaster for the absolute loss 
(p, y) = ilp — y|, where p € [—1, +1] and y € {—1, 1}. Prove a bound on the absolute loss 
of this forecaster in terms of the hinge loss L,,„ (u) of the best linear forecaster u with g-norm 
bounded by a known constant. Set the hinge y to 1 (Auer, Cesa-Bianchi, and Gentile [13]). 
Warning: This exercise is difficult. 

12.3 (Learning r-of-k threshold functions) An r-of-k threshold functions is a function f : 
{0, 1}4 > {-1, 1} specified by k relevant attributes indexed by i1,...,i, E€ {1,..., d}. 
On any xe {0, 1%, f(x) =1 if and only if xa +-+- +x; >r. Given a sequence 


12.4 


12.5 


12.6 


12.7 


12.8 


12.9 
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(X1, y1), - + +5 (Xn, Yn) € {0, 1}¢ x {—1, 1}, the attribute error Ay on the sequence is af; + 
+++ + yn, where ay; is the minimum number of components of x, that have to be changed to 
ensure that f(x,) = y,. 

Prove a mistake bound for the p-norm Perceptron on sequences over {0, 1}? x {—1, 1} 
such that the number of mistakes is bounded in terms of the attribute error A; of an arbitrary 
r-of-k threshold function f. Investigate what happens to the bound when p is set to 2 In(d + 1) 
(see Gentile [124]). 

(Parameterized second-order Perceptron) Consider the parameterized second-order fore- 
caster defined using 


t—1 

m 

A =al + > LARS 
s=1 


where a > 0 is a free parameter. Hence, the parameterless second-order Perceptron corre- 
sponds to the setting a = 1. Prove an analog of Theorem 12.3 for this variant. Investigate 
different choices of the parameter a. What happens to the mistake bound for a > oo? 
(Second-order vs. classical Perceptron) Show that there exists a choice of a such that inequal- 
ity (12.4) is satisfied when a < 1/(2k), where k is the number of nonzero eigenvalues of A,, 
(Cesa-Bianchi, Conconi, and Gentile [47]). 

(Proofs via the Blackwell condition) Consider the conservative classifiers introduced in 
Section 12.2. Observe that the weight vectors used by these classifiers can be equivalently 
defined using w, = V ®(R,), where R, is the cumulative “regret” 


t t 
R =) r =— >) Veys(ws-1) 
yal s=l 


and ¢,,;(Ws—1) is the hinge loss (y — ys Ws—1 - Xs)+. Verify that the Blackwell condition 
sup r,- VO(R,-1) <0 
ywre{-1,1} 


holds for this definition of regret. Then use Corollary 2.1 to derive the same mistake 
bound shown in Theorem 12.1. Hint: Use Corollary 2.1 to upper bound ®,(R,,) in terms 
of X; Tjs,4y,}, and then use Hélder’s inequality to show the lower bound 


IRI, = y DSUs — DS 40) 
t=1 t=1 


for u € R’ arbitrary (Cesa-Bianchi and Lugosi [54]). 


(ALMA on arbitrary sequences) Suppose ALMA is run on an arbitrary sequence (X1, yi), ..., 
(Xn, Yn),-.. € R? x {—1, 1}. Prove a bound on the number of updates of the form 


PE L,(u) x as a c& |L, u) 
y: y Y Y 


for any u € R’ with |jul| = 1 and for any y > 0. Note: The bound does not depend on a, 
but you might have to change the constants in the definition of y, and n, in order to prove it 
(Gentile [123]). 

(p-Norm ALMA) Prove a version of Theorem 12.4 using a modified p-norm Perceptron 
(Gentile [123]). 

(Label efficient Winnow) Adapt the proof of Theorem 12.2 to show that the label efficient 
version of Winnow, querying label Y, with probability c/(c + |p;|), and run with parame- 
ters n = 2ay/X2, and c = (1 —a)y for some 0 < œ < 1, achieves an expected number of 
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mistakes satisfying 
n 2 
E yom, P 1 Lyn) (=) Ind 
ar l-a y y 2a(1 — æ) 


for all u € R? in the probability simplex (Cesa-Bianchi, Lugosi, and Stoltz [55] and Cesa- 
Bianchi, Gentile, and Zaniboni [49]). 


(Label efficient second order Perceptron) Adapt the proof of Theorem 12.3 to show that 
the label efficient version of the second-order Perceptron, querying label Y, with probability 
c/(c + |p; |), achieves an expected number of mistakes satisfying 


n d 
Ly nU) c 1 
E È m| = eres a yp (llul? +u' Au) + F J In(1 + à;) 
t=1 i=l 


for any choice c > 0 of the input parameter and for all u € R? and y > 0. 


d+k 
k 


(Second-order Perceptron in dual variables) Show that the second-order Perceptron clas- 
sification at time ¢ can be computed using only inner product operations between instances 
Xis 005 Xp. 


Show by induction that a kth-degree surface in R@ is specified by ( ) coefficients. 


(All subsets kernel) Find an easily computable kernel for the mapping ¢ : R? > R” defined 
by (x) = (x4)a, where x/, = Tica x; and A ranges over all subsets of {1, ..., d} (Takimoto 
and Warmuth [286]). 

(ANOVA kernel) Consider the mapping ¢ such that, for any x € R, d(x) = (x/,)a, where 
x), =J],<4%; and A ranges over all subsets of {1, . . . , d} of size at most k for some fixed k = 
1,..., d. Direct computation of @(u) - #(v) takes time order of d*. Use dynamic programming 
to show that the same computation can be performed in time O (kd) (Watkins [306]). 


Appendix 


In this appendix we collect some of the technical tools used in the book and not proved in 
the main text. Most of the results reproduced here are quite standard; they are here to make 
the book as self-contained as possible. Here we take a minimalist approach and stick to the 
simplest possible versions that are necessary to follow the material in the main text. This 
appendix should not be taken as an attempt to an exhaustive survey. The cited references 
merely intend to point to the original source of the results. 


A.1 Inequalities from Probability Theory 


A.1.1  Hoeffding’s Inequality 
First we offer a proof of Lemma 2.2, which states the following: 


Lemma A.1. Let X be a random variable with a < X < b. Then for anys € R, 


why _ 72 
In s[e] <s x A, 


Proof. Since InE |e’ | 
variable X with E X = 0, 


< sEX +InE[e**—**)], it suffices to show that for any random 
a 


~ [e*] < es b-ay/8_ 


Note that by convexity of the exponential function, 


x—a b-x 
e™ < esei e“! fora < x < b. 
b-a b-a 


Exploiting E X = 0, and introducing the notation p = —a/(b — a), we get 


reS X < b ed a sb 
~ b-a b—a 
© (1 -p 4 per?) oe Ps(b-a) 
def 
def po, 


359 


360 Appendix 
where u = s(b — a), and @(u) = — pu + log(1 — p + pe"). But by straightforward calcu- 
lation it is easy to see that the derivative of ¢ is 


p 
p+(1-— p)e™ 


o'(u)=—pt+ 
and therefore #(0) = ¢’(0) = 0. Moreover, 


pA — pe~ z] 


EUS (p+(1— pe} 7 4 


Thus, by Taylor’s theorem, 


u? a s*(b — ay? 


2 
pu) = 60) +u¢'(0) + STAO) ZS - 


for some 0 € [0, u]. E 


Lemma A.1 was originally proven to derive the following result, also known as Hoeffd- 
ing’s inequality. 


Corollary A.1. Let X\,..., Xn be independent real-valued random variables such that for 
eachi = 1,...,n there exist some a; < b; such that Pla; < Xi < bi] = 1. Then for every 
e>0, 


n n 2 
apa -E X; > J <ar- —y) 
ist : 


i=l i=1 


and 


n n 2 
[$x - Son <-«| seme (-y 2). 
i=1 Vi i 


i=1 i=1 


Proof. The proof is based on a clever application of Markov’s inequality, often referred 
to as Chernoff’s technique: for any s > 0, 


k , E [exp(s X (Xi — DX ;)) | 
P ba — 1X;) > ] < exp(st) 


_ Tei E[exp(s(X; — EX;))] 
E exp(st) 


where we used independence of the variables X;. Bound the numerator using Lemma A.1 
and minimize the obtained bound in s to get the first inequality. The second is obtained by 
symmetry. W 


We close this section by a version of Corollary A.1, also due to Hoeffding [161], for 
the case when sampling is done without replacement. 
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Lemma A.2. Let the set A consist of N numbers a,,...,ay. Let Zi, ..., Zn denote a 
random sample taken without replacement from A, where n < N. Denote 

N 

m= — ai and c= max |a; — ajl]. 

Wy oat max |a; — ajl 


i=1 


Then for any £ > 0 we have 


| 


For more inequalities of this type, see Hoeffding [161] and Serfling [264]. 


1 n 
DI 


i=1 


Jne? Ic? 
ze|s2 2ne fct 


A.1.2 Bernstein’s Inequality 
Next we present inequalities that, in certain situations, give tighter bounds than Hoeffding’s 
inequality. The first result is a simple “poissonian” inequality. 


Lemma A.3. Let X be a random variable taking values in [0, 1]. Then, for any s € R, 


In z [e°*] < (e° — 1) X. 


Proof. As in the proof of Hoeffding’s inequality, we exploit the convexity of e** by 
observing that for any x € [0, 1], e°% < xe" + (1 — x). Thus, 


[e *] <EXe’ +1-EX. 


By the elementary inequality 1+x <e* we have EXe’+1—EX < e®-DEX as 
desired. W 


The next inequality is a version of Bernstein’s inequality [25]; see also Freedman [110], 
Neveu [224]. 


Lemma A.4. Let X be a zero-mean random variable taking values in (—oo, 1] with variance 
z X? = 0°. Then, for any n > 0, 


InEe™ <o*(e"7—1—n). 


Proof. The key observation is that the function (e* — x — 1)/x? is nondecreasing for all 
x € R. But then, since X < 1, 


eV — nX — 1 < X™(e"— n- 1). 


Taking expected values on both sides, taking logarithms, and using In(1 + x) < x, we 
obtain the stated result. W 


A simple consequence of Lemma A.4 is the following inequality. 


Lemma A.5. Let X be a random variable taking values in [0, 1]. Leto = yE X? — (EX)?. 
Then for any n > 0, 


In E [e "~ O] < o? (e”— 1- n) < EX(1—EX)(e”—1—7). 
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Proof. The first inequality is a direct consequence of Lemma A.4. The second inequality 
follows by noting that, since X € [0, 1], 


o? =EX*-(EXY <EX—-(EXY =EX(1—-EX). E 


Using Lemma A.4 together with Chernoff’s technique as in the proof of Corollary A.1, it 
now is easy to deduce the following result. 


Corollary A.2 (Bennett’s inequality). Let X,,..., X, be independent real-valued random 
variables with zero mean, and assume that X; < 1 with probability 1. Let 


1 n 
o? = DD X? 


Then for any t > 0, 


P È Xi > r| < exp (-no?n (=) ; 


i=1 


where h(u) = (1 + u) log + u) — u for u > 0. 


The message of this inequality is perhaps best seen if we do some further bounding. 
Applying the elementary inequality A(u) > u? /(2 + 2u/3), u > 0 (which may be seen by 
comparing the derivatives of both sides), we obtain a classical inequality of Bernstein [25]. 


Corollary A.3 (Bernstein’s inequality). Under the conditions of the previous theorem, for 
any € > 0, 


P D > ne? 
n EEE 202 +2e/3)` 


i=l 


A.1.3 Hoeffding—Azuma Inequality and Related Results 
The following extension of Hoeffding’s inequality to bounded martingale difference 
sequences is simple and useful. 

A sequence of random variables V1, V2,...is a martingale difference sequence with 
respect to the sequence of random variables X1, X2,...if, for every i > 0, V; is a function 
of X;,..., Xi, and 


E[Vi+1|X1,-..,Xi;]=0 with probability 1. 


Lemma A.6. Let Vi, V2,...be a martingale difference sequence with respect to some 
sequence X1, X2,...such that V; € [A;, Aj + ci] for some random variable A;, measur- 
able with respect to X,..., Xi—1, and a positive constant ci. If Sk = yar Vj, then for any 
s >0, 


l [e] < eD Eme, 
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Proof. 


J [e] =E [ese y [e | Kjaere || 
< F [errr] 


= er lB 7 [e=] , 


where we applied Lemma A.1. The desired inequality is obtained by iterating the argu- 
ment. W 


Just as in the case of Corollary A.1, we obtain the following corollary. 


Lemma A.7. Let Vi, V2,...be a martingale difference sequence with respect to some 
sequence X1, X2,...such that Vi € [A;, Ai + ci] for some random variable Aj, measur- 
able with respect to X,,..., Xi, and a positive constant ci. If Sn = )~;_, Vi, then for any 
t>0, 


—2r? 
P[S, > t] < exp Sog 
i=1 Ĉi 


and 


—2¢? 
P[S, < —t] < exp (<3) : 


i=1 Ĉi 


In fact, as noted in [161], the following “maximal” version of Lemma A.7 also holds: 


—21? 
P max S; >t] <exp Le : 
= i=l Ĉi 


We also need the following “Bernstein-like” improvement that takes variance information 
into account (see Freedman [110]). The proof, which we omit, is an extension of the 
independent case just shown. 


Lemma A.8 (Bernstein’s inequality for martingales). Let Xı,..., Xn be a bounded 
martingale difference sequence with respect to the filtration F = (Fj), <i<n and with |X;| < 
K. Let 


be the associated martingale. Denote the sum of the conditional variances by 


oy = 2 E [X | Fa]. 


Then for all constants t, v > 0, 


sas 


2 
P| max Si >t and 2 < v < exp | -= ] , 
i=l... 2(v + Kt/3) 
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and therefore, 


aes 


A.1.4 Khinchine’s Inequality 
Recall Lemma 8.2: 


Lemma A.9. Let a\,..., dy, be real numbers, and let 01, ..., Oy be i.i.d. sign variables 
with P[o, = 1] = Ploy = —1] = 1/2. Then 


Here we give a short and elegant proof (with a suboptimal constant 1/./3 instead of 1//2) 
due to Littlewood [204]. First note that for any random variable X with finite fourth moment, 


, 3/2 
1 |X| > ee 
Ex” 


Indeed, by Hölder’s inequality, 


7X2 = [1X143 [X [7/7] < ( oF Ga as | zx’. 


Applying this inequality for X = }`;_; ajo; gives 
(hia) 
E =1 Îi +3 Laa 


2 
where we used $ `;_] af +3504; 4703 <3 (X 147). 


n 


J 2 aj 0; 


A.1.5 Slud’s Inequality 
Here we recall, without proof, an inequality due to Slud [272] between binomial tails and 
their approximating normals. 


Lemma A.10. Let B be a binomial (n, p) random variable with p < 1/2. Then for n(1 — 
p)2=k2>np, 


PIB > k12 PÍN > vam | 


~ np — p) 


where N is a standard normal random variable. 
A.1.6 A Simple Limit Theorem 


Lemma A.11. Let {Zj} be i.i.d. Rademacher random variables (i =1,...,N3;t= 
2,...) with distribution P[Z;. = —1] = P[Z;: = 1] = 1/2, and let Gi,...,Gy be 
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independent standard normal random variables. Then 


n 
1 
lim | max F2 Zit | =E| max G;]. 
noo i=1,...,N n i=1,...,N 
t=1 


Proof. Define the N -vector X, = (Xn,1,.--, Xn,n) of components 


n 
det 1 
Xni = =) Zits iS eee 
vn is 


By the “Cramér—Wold device” (see, e.g., Billingsley [27, p. 48]), the sequence of vectors 
{X,} converges in distribution to a vector random variable G = (G,,...,Gwy) if and 
only if ae aiX n, i converges in distribution to SG a;G; for all possible choices of the 
coefficients a1, . . . , ay. Now clearly, 5 Si a;X n į converges in distribution, as n —> oo, to 
a zero-mean normal random variable with variance et a?. Then, by the Cramér—Wold 
device, as n — oo the vector X, converges in distribution to G = (G1, ..., Gx), where 
G,..., Gx are independent standard normal random variables. 

Convergence in distribution is equivalent to the fact that for any bounded continuous 
function y : R" > R, 


Jim E[W(Xn.1,---+Xnn)] =E[WG,..-,Gy)]- (A.1) 
Consider, in particular, the function Y (x1, ..., Xy) = L (max; x;), where L > 0, and ġ;z is 


the “thresholding” function 


—-L ifx < —L, 
La) =;į x ifẹx|< L, 
L ifx>L. 


Clearly, ġz is bounded and continuous. Hence, by (A.1), we conclude that 


ei j [o (e, Xna) = |o: (e, a) ' 


a heer N 


| ~ (x + _ max Xn) Limax; nse N tun 
i=1,...,N 


+E Ie + max Xn) Tmax;=,... ru ; 


where 


= (aan, Xn -1) T maxiat ahs asa] 
CO 
=) p | | max Xn, > Lu) au 
0 ave 
[0,6] 
al | max X», ; > u |ou 
L 1,...,.N 
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Q9 2 
< 2N / e™ 1? du 
L 
(by Hoeffding’s inequality; see Corollary A.1) 


ge 1 —u? /2 
L u 


2N 
= 2 eE, 
Therefore, we have, for any L > 0, 


oe she A ~ 2N —L?/2 
liminf E| max Xni| 2 Eļ|¢ġz{ max G; )|— —e ; 
i N i N L 


n= 


Letting L — œ on the right-hand side, and using the dominated convergence theorem, we 
see that 


liminf E | max X,,; | > E| max G; |. 
n>0oo 1,...,.N i 1,...,N 


t 2 a Lee 


The proof that 


w 


is similar. W 
For a proof of the next result see, for example, Galambos [122]. 


Lemma A.12. Let G,,...,Gy be independent standard normal random variables. Then 


lim 


Noo /21nN 


The following lemma is a related nonasymptotic inequality for maxima of subgaussian 
random variables. 


Lemma A.13. Let o > 0, and let X,,..., Xn be real-valued random variables such that 
for all. > Oand 1 <i < N,E[e**'] < e? Then 


Í, max, x | <ov2InN. 


Proof. By Jensen’s inequality, for all A > 0, 


MOR 


IA 
— 
nr 
= 
2 
— 
IA 
2 
n 
= 
N 
a 
N 
~ 
N 


i=l 
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Thus, 


and taking A = ,/2InN/o? yields the result. Ml 


A.1.7 Proof of Theorem 8.3 
The technique of the proof, called “chaining,” is due to Dudley [91]. 

For each k = 0, 1,2,..., let F® be a minimal cover of F of radius D2~*. Note that 
\F®|=N oF, D2-*). Denote the unique element of F® by fo. 

Let {2 be the common domain where the random variables Ty, f € F are defined. 
Pick w € Q and let f* € F be such that sup fF T;(@) = Ty+(@). (Here we implicitly 
assume that such an element exists. The modification of the proof for the general case is 
straightforward.) 

For each k > 0, let fě denote an element of F (© whose distance to f* is minimal. 
Clearly, o(f*, f°) < D2-*, and therefore, by the triangle inequality, for each k > 1, 


efi fe) S OCF, FE) + OF", fia) S 3D2*. (A.2) 
Clearly, limy_,o. f = f*, and so by the sample continuity of the process, 


T p(w) = Tpe(@) = Tro) + X (Tlo) -Tr ,(@)) - 
k=1 


Therefore 


CO 
T J Te: —T, 
Ẹ | < 2 [max ( f ol: 


where the max is taken over all pairs (f, g) € F® x FEI such that p(f, g) < 3D2~. 

Noting that there are at most N, (F, D2-*) of these pairs, and recalling that (Tp: fie 
F} is subgaussian in the metric p, we can apply Lemma A.13 using (A.2). Thus, for each 
k> 1, 


s [max (Ty — To)| < Deo InN (F, D2}. 


Summing over k, we obtain 


A 


CO 
E [wer < > 3D2*/2 InN ,(F, D2-*2 
k=1 


CO 
12 5 D2-*). In N, (F, D2-*) 


k=1 


D/2 
2f yla N (F, €)de, 
0 


IA 


as desired. W 
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A.1.8 Rademacher Averages 


Let A € R” be a bounded set of vectors a = (a1, . . . , a), and introduce the quantity 
RAD = | sup + > 
n = sup — Oidi |, 
Bey n i=l 
where 0), ..., On are independent random variables with P[o; = 1] = Plo; = —1] = 1/2. 


R,(A) is called the Rademacher average associated with A. R (A) measures, in a sense, 
the richness of set A. 

Next we recall some of the simple structural properties of Rademacher averages. Observe 
that if A is symmetric in the sense that a € A implies —a € A, then 


n 
J Oidi x 
i=1 


Let A, B be bounded symmetric subsets of R” and let c € R be a constant. Then the 
following subadditivity properties are obvious from the definition: 


R,(A)=E sup’ 


aca 1 


R, (AUB) < R,(A) + R, (B), 
R,(c - A) = |c|Rn(A), 
R,(A ® B) < R,(A) + R,(B), 


where c - A = {ca :a € A}and A @B = {a+b :a c€ A,b eB}. It follows from Hoeffd- 


ing’s inequality (Lemma A.1) and Lemma A.13 that if A = {a®, ... , a} c R’ isa finite 
set, then 
ay a/2log N 
R,(A) < max aP] VSEL, (A.3) 
j=1,..,N n 


Finally, we mention two important properties of Rademacher averages. The first is that 
if absconv(A) = { Jci cja? :NEN, DR Icj| < 1, ae A} is the absolute convex 
hull of A, then 


R,(A) = Rn (absconv(A)), 


as is easily seen from the definition. The second is known as the contraction principle: let 
@:R— R bea function with (0) = 0 and Lipschitz constant Ly. Defining ¢ o A as the 
set of vectors of form (#(a1), ..., (Gn)) € R” witha € A, we have 


R, ($ © A) < LR, (A). 


(see Ledoux and Talagrand [192]). Often it is useful to derive further upper bounds on 
Rademacher averages. As an illustration we consider the case when A is a subset of 
{—1, 1}”. Obviously, |A| < 2”. By inequality (A.3), the Rademacher average is bounded 
in terms of the logarithm of the cardinality of A. This logarithm may be upper bounded in 
terms of a combinatorial quantity, called the vc dimension. If A C {—1, 1}”, then the vc 


dimension of A is the size V of the largest set of indices {i,,...,iy} C {1,...,} such that 
for each binary V -vector b = (b1, ..., by) € {-1, 1}” there exists an a = (a1, ..., an) € 
A such that (a;i, ..., aip) = b. The key inequality establishing a relationship between 


shatter coefficients and vc dimension is known as Sauer’ s lemma (proved independently 
by Sauer [261], Shelah [266], and Vapnik and Chervonenkis [293]) which states that the 
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cardinality of any set A C {—1, 1}” may be upper bounded as 


V 
als D (p) s+" 


i=0 
where V is the vc dimension of A. In particular, for any A C {—1, 1}”, 


R,(A) < V2V log(n + 1) 
paa; n y 


This bound is a version of what has been known as the Vapnik—Chervonenkis inequality. 
By a somewhat refined analysis (based on chaining, very much in the spirit of the proof 
of Theorem 8.3), the logarithmic factor can be removed, and this results in a bound of the 


form 
V 
R, (A) < C4 — 
n 


for a universal constant C (Dudley [91]; see also Lugosi [206]). 


A.1.9 The Beta Distribution 
A random variable X taking values in [0, 1] is said to have the Beta distribution with 
parameters a, b > 0 if its density function is given by 


xt 1 nen x)?! 
B(a, b) 


where B(a,b)=T(a)(b)/T(a,b) is the so-called Beta function. Here V(a) = 
a x4—!e-* dx denotes Euler’s Gamma function. 

Let X have Beta distribution (a, b) and consider a random variable B such that, given 
X =x, the conditional distribution of B is binomial with parameters n and x. Then the 
marginal distribution of B is calculated, for k = 0, 1, ..., n, by 


f@)= 


xt lq = x)?! 
B(a, b) 


n f xta] _ xy -4tb-1 
— dx 
k) Jo B(a,b) 


_ OS 
Nk B(a, b) 


1 
PIB = k= f PIB =k |X =x] 
0 


Now it is easy to determine the conditional density of X given B = k: 


f(OP[B =k | X = x] 
PIB =k] 
xtl T DP Exa = xy 
(()Bk+a,n—k +b) 
xtti] = gyo 


B(k+a,n—=k+b) ` 


fx |B=h= 


This is recognized as a Beta distribution with parameters k + a and n — k + b. 
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A.2 Basic Information Theory 


In this section we summarize some basic properties of the entropy of a discrete-valued 
random variable. For an excellent introductory book on information theory we refer to 
Cover and Thomas [74]. 

Let X be a random variable taking values in the countable set X with distribution 
P[X = x] = p(x), x € X. The entropy of X is defined by 


H(X) = E|- log p(X)] = — X` po) log p(x) 
xEX 


(where log denotes natural logarithm and 0 log 0 = 0). If X, Y is a pair of discrete random 
variables taking values in Y x V, then the joint entropy H(X,Y) of X and Y is defined as 
the entropy of the pair (X, Y ). The conditional entropy H(X | Y ) is defined as 


H(X | Y) = H(X,Y)-H(). 
If we write p(x, y) = P[X =x, Y = y] and p(x | y) = P[X =x | Y = y], then 
H(X|Y)=— Yo pæ, y)log px |y), 
xeEX,yey 


from which we see that H(X | Y) > 0. It is also easy to see that the defining identity of 
the conditional entropy remains true conditionally, that is, for any three (discrete) random 
variables X,Y, Z: 


H(X,Y | Z)= HY | Z)+ H(X |Y, Z). 


(Just add H (Z) to both sides and use the definition of the conditional entropy.) A repeated 
application of this yields the chain rule for entropy: for arbitrary discrete random variables 
Xı Kra Xa 


H(X1,...,Xn) 
= H(X) + H (X2 | X1) + H(X3 | X1, X2) +--+ + A(X, | X1, 3 Xn-1)- 


Let P and Q be two probability distributions over a countable set X with probability mass 
functions p and q. Then the Kullback—Leibler divergence or relative entropy of P and Q is 


D(PIQ)= J. px)log a 


xEX : p(x)>0 q(x) 


Since logx < x — 1, 


DPIQ=- E rwo- E po) (42-1) 0 


xEX : p(x)>0 px) xEX : p(x)>0 


Hence, the relative entropy is always nonnegative and equals 0 if and only if P = Q. This 
simple fact has some interesting consequences. For example, if X is a finite set with N 
elements, X is a random variable with distribution P, and we take Q to be the uniform 
distribution over X, then D(P||Q) = log N — H(X), and therefore the entropy of X never 
exceeds the logarithm of the cardinality of its range. Another immediate consequence of 
the nonnegativity of the relative entropy is the so-called /og-sum inequality, which states 
that if a1, a2, . . . and bj, b2, ... are nonnegative numbers with A = }_; a; and B = };; b;, 
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then 
Yai log > Alo a 
di Tna Ry 
- 2 b; g B 
Let now P and Q be distributions over an n-fold product space X”, and for t = 1,..., n 
denote by 
Pi&i) DO PO., ma) 
Xt Xtp1seXn 

and 


Qai,- X1) = > O(X1,..-5 Xn) 
Xp Xp, e005 Xn 
the marginal distributions of the first t — 1 variables. Then the chain rule for relative entropy 
(a straightforward consequence of the definition) states that 


DPE... OP, lOa) 


t=1 X1,...,X¢-1 
where Pix,,....x,_, denotes the conditional distribution over 1”~‘ defined by 


P(x, oa GR) 


Px 4 XtyXt4+15-++5 Xn) = — 
lester t>At+l> ’ n) P Oi, aaa Seay 


and Q),,,...x,_, 18 defined similarly. 
The following fundamental result is known as Pinsker’s inequality. For any pair of 
probability distributions P and Q, 


1 
YxzPPIQD= 2, (P@)- Oe). 


x: PQ)=O(x) 


Sketch of Proof. First prove the inequality if P and Q are concentrated on the same two 
atoms. Then define A = {x : P(x) > O(x)} and the measures P*, Q* on the set {0, 1} 
by P*(0) = 1 — P*(1) = P(A) and Q*(0) = 1 — Q*(1) = Q(A), and apply the previous 
result. W 


A.3 Basics of Classification 


In this section we summarize some basic facts of the probabilistic theory of binary clas- 
sification. For more details we refer to Devroye, Györfi, and Lugosi [88]. The problem of 
binary classification is to guess the unknown binary class of an observation. An observation 
x is an element of a measurable space X. The unknown nature of the observation is called 
a class, denoted by y, and takes values in the set {0, 1}. 

In the probabilistic model of classification, the observation/label pair is modeled as a 
pair (X, Y ) of random variables taking values in ¥ x {0, 1}. 

The posterior probabilities are defined, for all x € X, by 


n(x) = PY = 1| X =x] =E[Y |X =x]. 


Thus, n(x) is the conditional probability that Y is 1, given X = x. 
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Any function g : ¥ — {0, 1} defines a classifier, and the value g(x) represents one’s 
guess of y given x. An error occurs if g(x) Æ y, and the probability of error for a classifier 


gis 
L(g) = Ple(X) # Y]. 
The next lemma shows that the Bayes classifier given by 


1 ifn(x)>1/2 
0 otherwise 


gœ) = | 
minimizes the probability of error. Its probability of error L(g*) is called the Bayes error. 


Lemma A.14. For any classifier g : X — {0, 1}, 
Ple*(X) AY] < Ple(X) # Y]. 


Proof. Given X =x, the conditional probability of error of any decision g may be 
expressed as 


P[g(X) AY |X =x] 
=1-—P[¥ = 9(X)|X =x] 
=1-— (PIY = 1, 9(X)=1| X = x] + P[Y =0, g(X) =0| X =x]) 
= 1 — (Ieuan PIY = 1 | X = x] + Igo) PIY = 0 | X = x]) 
=1— (Itr 9) + kgo C — n(x))). 
Thus, for every x € X, 
P[g(X) AY |X =x]—Plg*(X) FY | X =x] 
= n(x) (Ttercn=t) — ewn) + (1 = 2) Mews — eao) 
= (2n(X) = 1) (lewn — ewn) 
>0 


by the definition of g*. The statement now follows by taking expected values of both 
sides. W 


Lemma A.15. The Bayes error may be written as 
L(g*) = Plg*(X) 4 Y] = E[min{n(X), 1 — n0}. 
Moreover, for any classifier g, 


L(g) — L(g*) = 2E[ |n(X) — 1/2 Trgoozg y] - 


Proof. The proof of the previous lemma reveals that 


L(g) = 1 — E[ltecxyaty nX) + Tyger =o} (1 — 0(X))] 


and, in particular, 
L(g*) = 1 — E[Igaxy>1/2) nX) + [nexyei/2y (1 — n(X))]. 


The statements are immediate consequences of these expressions. W 
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